OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHMS

Size: px

Start display at page:

Download "OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHMS"

Kristina Walsh
6 years ago
Views:

1 1 OPTIMAL AFFINE INVARIANT SMOOTH MINIMIZATION ALGORITHMS ALEXANDRE D ASPREMONT, CRISTÓBAL GUZMÁN, AND MARTIN JAGGI ABSTRACT. We formulate an affine invariant imlementation of the accelerated first-order algorithm in [Nesterov, 1983]. Its comlexity bound is roortional to an affine invariant regularity constant defined with resect to the Minkowski gauge of the feasible set. We extend these results to more general roblems, otimizing Hölder smooth functions using -uniformly convex rox terms, and derive an algorithm whose comlexity better fits the geometry of the feasible set and adats to both the best Hölder smoothness arameter and the best gradient Lischitz constant. Finally, we detail matching comlexity lower bounds when the feasible set is an l ball. In this setting, our uer bounds on iteration comlexity for the algorithm in [Nesterov, 1983] are thus otimal in terms of target recision, smoothness and roblem dimension. 1. INTRODUCTION Here, we show how to imlement the smooth minimization algorithm described in [Nesterov, 1983, 005] so that both its iterations and its comlexity bound are invariant with resect to a change of coordinates in the roblem. We focus on a generic convex minimization roblem written minimize f(x) subject to x Q, where f is a convex function with Lischitz continuous gradient and Q is a comact convex set. Without too much loss of generality, we will assume that the interior of Q is nonemty and contains zero. When Q is sufficiently simle, in a sense that will be made recise later, Nesterov [1983] showed that this roblem can be solved with a comlexity O(1/ ε), where ε is the target recision. Furthermore, it can be shown that this comlexity bound is otimal in ε for the class of smooth roblems [Nesterov, 003]. While the deendence in 1/ ε of the comlexity bound in Nesterov [1983] is otimal in ε, the various factors in front of that bound contain arameters which can heavily vary with imlementation, i.e. the choice of norm and rox regularization function. In fact, the full uer bound on the iteration comlexity of the otimal algorithm in [Nesterov, 003] is written 8Ld(x ) (1) ε where L is the Lischitz constant of the gradient, d(x ) the value of the rox at the otimum and its strong convexity arameter, all varying with the choice of norm and rox. This means in articular that, everything else being equal, this bound is not invariant with resect to an affine change of coordinates. Arguably then, the comlexity bound varies while the intrinsic comlexity of roblem (1) remains unchanged. Otimality in ε is thus no guarantee of comutational efficiency, and a oorly arameterized otimal method can exhibit far from otimal numerical erformance. On the other hand, otimal choices of norm and rox, hence of L and d should roduce affine invariant bounds. Hence, affine invariance, besides its imlications in terms of numerical stability, can also act as a guide to otimally choose norm and rox. Here, we show how to choose an underlying norm and a rox term for the algorithm in [Nesterov, 1983, 005] which make its iterations and comlexity invariant by a change of coordinates. In Section 3, we construct the norm as the Minkowski gauge of centrally symmetric sets Q, then derive the rox using a definition of the regularity of Banach saces used by [Juditsky and Nemirovski, 008] to derive concentration 1 Part of this work was done during a ostdoctoral osition at Centrum Wiskunde & Informatica Date: November 8,

2 inequalities. These systematic choices allow us to derive an affine invariant bound on the comlexity of the algorithm in [Nesterov, 1983]. When Q is an l ball, we show that this comlexity bound is otimal for most, but not all, high dimensional regimes in (n, ε). In Section 4, we thus extend our results to much more general roblems, deriving a new algorithm to otimize Hölder smooth functions using -uniformly convex rox functions. This extends the results of [Nemirovskii and Nesterov, 1985] by incororating adativity to the Hölder continuity of the gradient, and those of [Nesterov, 015] by allowing general uniformly convex rox functions, not just strongly convex ones. These additional degrees of freedom allow us to match otimal comlexity lower bounds derived in Section 5 from [Guzmán and Nemirovski, 015] when otimizing on l balls, with adativity in the Hölder smoothness arameter and Lischitz constant as a bonus. This means that, on l -balls at least, our comlexity bounds are otimal not only in terms of target recision ε, but also in terms of smoothness and roblem dimension. This shows that, in the l setting at least, affine invariance does indeed lead to otimal comlexity.. SMOOTH OPTIMIZATION ALGORITHM We first recall the basic structure of the algorithm in [Nesterov, 1983]. While many variants of this method have been derived, we use the formulation in [Nesterov, 005]. We choose a norm and assume that the function f in roblem (1) is convex with Lischitz continuous gradient, so f(y) f(x) + f(x), y x + 1 L y x, x, y Q, () for some L > 0. We also choose a rox function d(x) for the set Q, i.e. a continuous, strongly convex function on Q with arameter (see Nesterov [003] or Hiriart-Urruty and Lemaréchal [1993] for a discussion of regularization techniques using strongly convex functions). We let x 0 be the center of Q for the rox-function d(x) so that x 0 argmin d(x), x Q assuming w.l.o.g. that d(x 0 ) = 0, we then get in articular We write T Q (x) a solution to the following subroblem T Q (x) argmin y Q d(x) 1 x x 0. (3) { f(x), y x + 1 L y x }. (4) We let y 0 T Q (x 0 ) where x 0 is defined above. We recursively define three sequences of oints: the current iterate x t, the corresonding y t = T Q (x t ), and the oints { } L t z t argmin x Q d(x) + α i [f(x i ) + f(x i ), x x i ] (5) i=0 given a ste size sequence α k 0 with α 0 (0, 1] so that x t+1 = τ t z t + (1 τ t )y t y t+1 = T Q (x t+1 ) (6) where τ t = α t+1 /A t+1 with A t = t i=0 α i. We imlicitly assume here that Q is simle enough so that the two subroblems defining y t and z t can be solved very efficiently. We have the following convergence result.

3 Theorem.1. Nesterov [005]. Suose α t = (t + 1)/ with the iterates x t, y t and z t defined in (5) and (6), then for any t 0 we have f(y t ) f(x ) 4Ld(x ) (t + 1) where x is an otimal solution to roblem (1). If ε > 0 is the target recision, Theorem.1 ensures that Algorithm 1 will converge to an ε-accurate solution in no more than 8Ld(x ) (7) ε iterations. In ractice of course, d(x ) needs to be bounded a riori and L and are often hard to evaluate. Algorithm 1 Smooth minimization. Inut: x 0, the rox center of the set Q. 1: for t = 0,..., T do : Comute f(x t ). 3: Comute y t = T Q (x t ). 4: Comute z t = argmin x Q { L d(x) + t i=0 α i[f(x i ) + f(x i ), x x i ] }. 5: Set x t+1 = τ t z t + (1 τ t )y t. 6: end for Outut: x T, y T Q. While most of the arameters in Algorithm 1 are set exlicitly, the norm and the rox function d(x) are chosen arbitrarily. In what follows, we will see that a natural choice for both makes the algorithm affine invariant. 3. AFFINE INVARIANT IMPLEMENTATION We can define an affine change of coordinates x = Ay where A R n n is a nonsingular matrix, for which the original otimization roblem in (1) is transformed so in the variable y R n, where minimize f(x) subject to x Q, becomes minimize subject to ˆf(y) y ˆQ, ˆf(y) f(ay) and ˆQ A 1 Q. (9) Unless A is athologically ill-conditioned, both roblems are equivalent and should have identical comlexity bounds and iterations. In fact, the comlexity analysis of Newton s method based on the self-concordance argument develoed in [Nesterov and Nemirovskii, 1994] roduces affine invariant comlexity bounds and the iterates themselves are invariant. Here we will show how to choose the norm and the rox function d(x) to get a similar behavior for Algorithm Choosing the Norm. We start by a few classical results and definitions. Recall that the Minkowski gauge of a set Q is defined as follows. Definition 3.1. Given Q R n containing zero, we define the Minkowski gauge of Q as γ Q (x) inf{λ 0 : x λq} with γ Q (x) = 0 when Q is unbounded in the direction x. 3 (8)

4 When Q is a comact convex, centrally symmetric set with resect to the origin and has nonemty interior, the Minkowski gauge defines a norm. We write this norm Q γ Q ( ). From now on, we will assume that the set Q is centrally symmetric or use for examle Q = Q Q (in the Minkowski sense) for the gauge when it is not (this can be imroved and extending these results to the nonsymmetric case is a classical toic in functional analysis). Note that any linear transform of a centrally symmetric convex set remains centrally symmetric. The following simle result shows why Q is otentially a good choice of norm for Algorithm 1. Lemma 3.. Suose f : R n R, Q is a centrally symmetric convex set with nonemty interior and let A R n n be a nonsingular matrix. Then f has Lischitz continuous gradient with resect to the norm Q with constant L > 0, i.e. f(y) f(x) + f(x), y x + 1 L y x Q, x, y Q, if and only if the function ˆf(w) f(aw) has Lischitz continuous gradient with resect to the norm A 1 Q with the same constant L. Proof. Let w, y Q, with y = Az and x = Aw, then is equivalent to f(y) f(x) + f(x), y x + 1 L y x Q, x, y Q, f(az) f(aw) + A T w f(aw), Az Aw + 1 L Az Aw Q, z, w A 1 Q, and, using the fact that Aw Q = w A 1 Q, this is also f(az) f(aw) + w f(aw), A 1 (Az Aw) + 1 L z w A 1 Q, z, w A 1 Q, hence the desired result. An almost identical argument shows the following analogous result for the roerty of strong convexity with resect to the norm Q and affine changes of coordinates. However, when starting from the above Lemma 3., this can also be seen as a consequence of the well-known duality between smoothness and strong convexity (see e.g. [Hiriart-Urruty and Lemaréchal, 1993, Cha. X, Theorem 4..1]). Theorem 3.3. Let f : Q R be a convex l.s.c. function. Then f is strongly convex w.r.t. norm with constant µ > 0 if and only f has Lischitz continuous gradient w.r.t. norm with constant L = 1/µ. From the revious two results, we immediately have the following lemma. Lemma 3.4. Suose f : R n R, Q is a centrally symmetric convex set with nonemty interior and let A R n n be a nonsingular matrix. Suose f is strongly convex with resect to the norm Q with arameter > 0, i.e. f(y) f(x) + f(x), y x + 1 y x Q, x, y Q, if and only if the function ˆf(w) f(aw) is strongly convex with resect to the norm A 1 Q with the same arameter. We now turn our attention to the choice of rox function in Algorithm 1. 4

5 3.. Choosing the Prox. Choosing the norm as Q allows us to define a norm without introducing an arbitrary geometry in the algorithm, since the norm is extracted directly from the roblem definition. Notice furthermore that by Theorem 3.3 when ( Q ) is smooth, we can set d(x) = x Q. The immediate imact of this choice is that the term d(x ) in (7) is bounded by one, by construction. This choice has other natural benefits which are highlighted below. We first recall a result showing that the conjugate of a squared norm is the squared dual norm. Lemma 3.5. Let be a norm and its dual norm, then 1 ( y ) = su x y T x 1 x. Proof. We recall the roof in [Boyd and Vandenberghe, 004, Examle 3.7] as it will rove useful in what follows. By definition, x T y y x, hence y T x 1 x y x 1 x 1 ( y ) because the second term is a quadratic function of x, with maximum ( y ) /. This maximum is attained by any x such that x T y = y x (there must be one by construction of the dual norm), normalized so x = y, which yields the desired result. Comuting the rox-maing in (4) amounts to taking the conjugate of, so this last result (and its roof) shows that, in the unconstrained case, solving the rox maing is equivalent to finding a vector aligned with the gradient, with resect to the Minkowski norm Q. We now recall another simle result showing that the dual of the norm Q is given by Q where Q is the olar of the set Q. Lemma 3.6. Let Q be a centrally symmetric convex set with nonemty interior, then Q = Q. Proof. We write which is the desired result. x Q = inf{λ 0 : x λq } = inf{λ 0 : x T y λ, for all y Q} { } = inf λ 0 : su x T y λ y Q = su x T y = x Q y Q In the light of the results above, we conclude that whenever Q is smooth we obtain a natural rox function d(x) = x Q, whose strong convexity arameter is controlled by the Lischitz constant of the gradient of Q. However, this does not cover the case where the squared norm Q is not strongly convex. In that scenario, we need to ick the norm based on Q but find a strongly convex rox function not too different from Q. This is exactly the dual of the roblem studied by Juditsky and Nemirovski [008] who worked on concentration inequalities for vector-valued martingales and defined the regularity of a Banach sace (E, E ) in terms of the smoothness of the best smooth aroximation of the norm. E. We first recall a few more definitions, and we will then show that the regularity constant defined by Juditsky and Nemirovski [008] roduces an affine invariant bound on the term d(x )/ in the comlexity of the smooth algorithm in [Nesterov, 1983]. Definition 3.7. Suose X and Y are two norms on a sace E, the distortion d( X, Y ) between these two norms is equal to the smallest roduct ab > 0 such that over all x E. 1 b x Y x X a x Y Note that log d( X, Y ) defines a metric on the set of all symmetric convex bodies in R n, called the Banach-Mazur distance. We then recall the regularity definition in [Juditsky and Nemirovski, 008]. 5

6 Definition 3.8. The regularity constant of a Banach sace (E,. ) is the smallest constant > 0 for which there exists a smooth norm (x) such that (i) (x) / has a Lischitz continuous gradient with constant µ w.r.t. the norm (x), with 1 µ, (ii) the norm (x) satisfies hence d((x),. ) /µ. x (x) µ x, for all x E (10) Note that in finite dimension, since all norms are equivalent to the Euclidean norm with distortion at most dim E, we know that all finite dimensional Banach saces are at least (dim E)-regular. Furthermore, the regularity constant is invariant with resect to an affine change of coordinates since both the distortion and the smoothness bounds are. We are now ready to rove the main result of this section. Proosition 3.9. Let ε > 0 be the target recision, suose that the function f has a Lischitz continuous gradient with constant L Q with resect to the norm Q and that the sace (R n, Q ) is Q-regular, then Algorithm 1 will roduce an ε-solution to roblem (1) in at most 4LQ Q (11) ε iterations. The constants L Q and Q are affine invariant. Proof. If (R n, Q ) is Q-regular, then by Definition 3.8, there exists a norm (x) such that (x) / has a Lischitz continuous gradient with constant µ with resect to the norm (x), and [Juditsky and Nemirovski, 008, Pro. 3.] shows by conjugacy that the rox function d(x) (x) / is strongly convex with resect to the norm (x) with constant 1/µ. Now (10) means that µ x Q (x) x Q, for all x Q Q since =, hence d(x + y) d(x) + d(x), y + 1 µ (y) d(x) + d(x), y + 1 Q y Q so d(x) is strongly convex with resect to Q with constant = 1/ Q, and using (10) as above d(x ) = (x ) Q x Q Q Q by definition of Q, if x is an otimal (hence feasible) solution of roblem (1). The bound in (11) then follows from (7) and its affine invariance follows directly from affine invariance of the distortion and Lemmas 3. and 3.4. In Section 5, we will see that, when Q is an l ball, the comlexity bound in (11) is otimal for most, but not all, high dimensional regimes in (n, ε) where n is greater than ε 1/. In the section that follows, we thus extend Algorithm 1 to much more general roblems, otimizing Hölder smooth functions using -uniformly convex rox functions (not just strongly convex ones). These additional degrees of freedom will allow us to match otimal comlexity lower bounds, with adativity in the Hölder smoothness arameter as a bonus. 6

7 4. HÖLDER SMOOTH FUNCTIONS & UNIFORMLY CONVEX PROX We now extend the results of Section 3 to roblems where the objective f(x) is Hölder smooth and the rox function is -uniformly convex, with arbitrary. This generalization is necessary to derive otimal comlexity bounds for smooth convex otimization over l -balls when >, and will require some extensions of the ideas we resented for the standard analysis, which was based on a strongly convex rox. We will consider a slightly different accelerated method, that can be seen as a combination of mirror and gradient stes Allen-Zhu and Orecchia [014]. This variant of the accelerated gradient method is not substantially different however from the one used in the revious section, and its urose is to make the ste-size analysis more transarent. It is worth emhasizing that an interesting byroduct of our method is the analysis of an adative ste-size olicy, which can exloit weaker levels of Hölder continuity for the gradient. In order to motivate our choice of -uniformly convex rox, we begin with an examle highlighting how the difficult geometries of l -saces when > necessarily lead to weak (dimension-deendent) comlexity bounds for any rox. Examle 4.1. Let < and let B be the unit -ball on R n. Let Ψ : R n R be any strongly convex (rox) function w.r.t. with constant 1, and suose w.l.o.g. that Ψ(0) = 0. We will rove that su Ψ(x) n 1 / /. x B We start from oint x 0 = 0 and choose a direction e 1+ {e 1, e 1 } so that Ψ(0), e By strong convexity, we have, for x 1 x 0 + e 1 n 1/, Ψ(x 1 ) Ψ(x 0 ) + Ψ(0), e x 1 x 0 1 n /. Inductively, we can roceed adding coordinate vectors one by one, x i x i 1 + e i+, for i = 1,..., n, n 1/ where e i+ {e i, e i } is chosen so that Ψ(x i 1 ), e i+ 0. For this choice we can guarantee Ψ(x i ) Ψ(x i 1 ) + Ψ(x i 1 ), e i+ + 1 n / At the end, the vector x n B and Ψ(x n ) n 1 / /. i n / Uniform Convexity and Smoothness. The revious examle shows that strong convexity of the rox function is too restrictive when dealing with certain domain geometries, such as Q = B when >. In order to obtain dimension-indeendent bounds for these cases, we will have to consider relaxed notions of regularity for the rox, namely -uniform convexity and its dual notion of q-uniform smoothness. For simlicity, we will only consider the case of subdifferentiable convex functions, which suffices for our uroses. Definition 4. (Uniform convexity and uniform smoothness). Let <, µ > 0 and Q R n, a closed convex set. A subdifferentiable function Ψ : Q R is -uniformly convex with constant µ w.r.t. iff for all x Q, y Q, Ψ(y) Ψ(x) + Ψ(x), y x + µ y x. (1) Now let 1 < q and L > 0. A subdifferentiable function Φ : Q R is q-uniformly smooth with constant L w.r.t. iff for all x Q, y Q, Φ(y) Φ(x) + Φ(x), y x + L q y x q. (13) From now on, whenever the constant µ of -uniform convexity is not exlicitly stated, µ = 1. We turn our attention to the question of how to obtain an affine invariant rox in the uniformly convex setu. In the revious section it was observed that the regularity constant of the dual sace rovided such tuning among strongly convex rox functions, however we are not aware of extensions of this notion to the uniformly smooth setu. Nevertheless, the same urose can be achieved by directly minimizing the growth factor among the class of uniformly convex functions, which leads to the following notion. 7

8 Definition 4.3 (Constant of variation). Given a -uniformly convex function Ψ : R n R, we define its constant of variation on Q as D Ψ (Q) su x Q Ψ(x) inf x Q Ψ(x). Furthermore, we define { } D,Q inf su Ψ(x) Ψ Ψ : Q R + is -uniformly convex w.r.t., Ψ(0) = 0. (14) x Q Some comments are in order. First, for fixed, the constant D,Q rovides the otimal constant of variation among -uniformly convex functions over Q, which means that, by construction, D,Q is affineinvariant. Second, Examle 4.1 showed that when < <, we have D,B n 1 / /, and the function Ψ(x) = x / shows this bound is tight. We will later see that D,B = 1, which is a major imrovement for large dimensions. [Juditsky and Nemirovski, 008, Pro. 3.3] also shows that Q D,Q c Q, where Q is the regularity constant defined in (3.8) and c > 0 is an absolute constant, since Ψ(x) is not required to be a norm here. When Q is the unit ball of a norm, a classical result by Pisier [1975] links the constant of variation in (14) above with the notion of martingale cotye. A Banach sace (E,. ) has M-cotye q iff there is some constant C > 0 such that for any T 1 and martingale difference d 1,..., d T E we have ( T 1/q [ E [ d t ]) q ] C T E t=1 d t t=1 Pisier [1975] then shows the following result. Theorem 4.4. [Pisier, 1975] A Banach sace (E,. ) has M-cotye q iff there exists a q uniformly convex norm equivalent to. In the same sirit, there exists a concrete characterization of a function achieving the otimal constant of variation, see e.g. [Srebro et al., 011]. Unfortunately, this characterization does not lead to an efficiently comutable rox. For the analysis of our accelerated method with uniformly convex rox, we will also need the notion of Bregman divergence. Definition 4.5 (Bregman divergence). Let (R n, ) be a normed sace, and Ψ : Q R be a -uniformly convex function w.r.t.. We define the Bregman divergence as V x (y) Ψ(y) Ψ(x), y x Ψ(x) x Q, y Q. Observe that V x (x) = 0 and V x (y) 1 y x. For starters, let us rove a simle fact that will be useful in the comlexity bounds. Lemma 4.6. Let Ψ : Q R be a -uniformly convex function, and V x ( ) the corresonding Bregman divergence. Then, for all x, x and u in Q Proof. From simle algebra V x (u) V x (u) V x (x ) V x (u) V x (u) V x (x ) = V x (x ), u x. = Ψ(u) Ψ(x), u x Ψ(x) [Ψ(u) Ψ(x ), u x Ψ(x )] V x (x ) = Ψ(x ) Ψ(x), u x + Ψ(x ) Ψ(x), x x Ψ(x) V }{{} x (x ). =V x(x ) = Ψ(x ) Ψ(x), u x = V x (x ), u x. which is the desired result. 8

9 4.. An Accelerated Method for Minimizing Hölder Smooth Functions. We consider classes of weaklysmooth convex functions. For a Hölder exonent (1, ] we denote the class F (Q, L ) as the set of convex functions f : Q R such that for all x, y Q f(x) f(y) L x y 1. Before describing the method, we first define a ste sequence that is useful in the algorithm. For a given such that <, consider the sequence (γ t ) t 0 defined by γ 1 = 1 and for any t > 1, γ t+1 is the major root of This sequence has the following roerties. γ t+1 = γ 1 t+1 + γ t. Proosition 4.7. The following roerties hold for the auxiliary sequence (γ t ) t 0 (i) The sequence is increasing. (ii) γ t = t s=1 γ 1 s. (iii) t γ t t. (iv) t s=1 γ s tγ t. Proof. We get (i) By definition, γ t+1 = γ 1 t+1 + γ t γ t, thus γ t+1 γ t. (ii) By telescoing the recursion, γ t = t s=1 γ 1 s. (iii) For the lower bound, a Fenchel tye inequality yields γ t = γ 1/ t+1 [γ t+1 1] 1/ γ t+1 + γ t+1 1 = γ t+1 1. The uer bound is roved by induction as follows (t + 1) = (t + 1) 1 + t(t + 1) 1 > (t + 1) 1 + t[t 1 + ( 1)t ] (t + 1) 1 + γ t, where the last inequality holds by induction hyothesis, γ t t. As a conclusion, the major root defining γ t+1 has to be at most t + 1. (iv) By (ii), we have, t γs = s=1 which concludes the roof. t s s=1 r=1 γ 1 s = t (t r)γ 1 t r=1 r t r=1 γ 1 r = tγ t. We now rove a simle Lemma controlling the smoothness of f terms of. This idea is a minor extension of the inexact gradient trick roosed in [Devolder et al., 011] and further studied in [Nesterov, 015], which corresonds to the secial case where = in the results described here. As in [Devolder et al., 011], this trick will allow us to minimize Hölder smooth functions by treating their gradient as an inexact oracle on the gradient of a smooth function. Lemma 4.8. Let f F (Q, L ), then for any δ > 0 and we have that for all x, y Q M [ ( ) ] 1 δ L (15) f(y) f(x) + f(x), y x + 1 M y x + δ. 9

10 Proof. By assumtion on f, the following bound holds for any x, y Q Notice first that it suffices to show that for all t 0 f(y) f(x) + f(x), y x + L y x. L t M t + δ. (16) This can be seen by letting t = y x and using (16) in the receding inequality. Let us rove (16). First recall the following Fenchel tye inequality: if r, s 1 and 1/r + 1/s = 1 then for all x and y we have that xy 1 r xr + 1 s ys. For r = /, s = /( ) and x = t, we obtain L t 1 L y t + L ( ) y. Now we choose y so that δ = L( ) y, which leads to the inequality L t 1 L y t + δ. Finally, by our choice of M we have that M L /y, roving (16) and therefore the result. Algorithm Accelerated Method with Bregman Prox Inut: x 0 Q 1: y 0 = x 0, z 0 = x 0, and A 0 = 0 : for t = 0,..., T 1 do 3: α t+1 = γ 1 t+1 /M 4: A t+1 = A t + α t+1 5: τ t = α t+1 /A t+1 6: x t+1 = τ t z t + (1 τ t )y t 7: Obtain from oracle f(x t+1 ), and udate 8: end for 9: return y T y t+1 = arg min y Q { M y x t+1 + f(x t+1 ), y x t+1 z t+1 = arg min z Q {V z t (z) + α t+1 f(x t+1 ), z z t } (18) } (17) As we will show below, the accelerated method described in Algorithm extends the l -setting for acceleration first roosed by [Nemirovskii and Nesterov, 1985] to nonsmooth saces, using Bregman divergences. This gives us more flexibility in the choice of rox function and allows us in articular to better fit the geometry of the feasible set. Proosition 4.9. Let f F (Q, L ) and Ψ : Q R be -uniformly convex w.r.t.. Then for any ε > 0, setting δ ε/t, and M satisfying (15), the accelerated method in Algorithm guarantees an accuracy f(y T ) f(y ) D Ψ(Q) A T + ε after T iterations. 10

11 Proof. Let u Q be an arbitrary vector. Using the otimality conditions for subroblem (18), and Lemma 4.6, we get α t+1 f(x t+1 ), z t u = α t+1 f(x t+1 ), z t z t+1 + α t+1 f(x t+1 ), z t+1 u α t+1 f(x t+1 ), z t z t+1 V zt (z t+1 ), z t+1 u = α t+1 f(x t+1 ), z t z t+1 + V zt (u) V zt+1 (u) V zt (z t+1 ) [α t+1 f(x t+1 ), z t z t+1 1 ] z t z t+1 + V zt (u) V zt+1 (u). Let us examine the latter term in brackets closely. For this, let v = τ t z t+1 + (1 τ t )y t and note that x t+1 v = τ t (z t z t+1 ). With τ t = α t+1 /A t+1 we also have, using Proosition 4.7 (ii), ( 1 L ) t+1 τ s=1 = γ 1 s t Lγ 1 = γ t+1 = MA t+1. t+1 From this we obtain α t+1 f(x t+1 ), z t z t+1 1 z t z t+1 αt+1 = f(x t+1 ), x t+1 v 1 τ t τ x t+1 v t [ = A t+1 f(x t+1 ), x t+1 v M ] x t+1 v [ A t+1 f(x t+1 ), x t+1 y t+1 M ] x t+1 y t+1 [ A t+1 f(x t+1 ) f(y t+1 ) + δ ], where the first inequality holds by the definition of y t+1, and the last inequality holds by Lemma 4.8 and the choice of M. This means that α t+1 f(x t+1 ), z t u A t+1 [f(x t+1 ) f(y t+1 ) + δ/] + V zt (u) V zt+1 (u). (19) From (19) and other simle estimates Therefore α t+1 [f(x t+1 ) f(u)] α t+1 f(x t+1 ), x t+1 u = α t+1 f(x t+1 ), x t+1 z t + α t+1 f(x t+1 ), z t u = (1 τ t)α t+1 τ t f(x t+1 ), y t x t+1 + α t+1 f(x t+1 ), z t u (1 τ t)α t+1 τ t [f(y t ) f(x t+1 )] + α t+1 f(x t+1 ), z t u (1 τ t)α t+1 [f(y t ) f(x t+1 )] + A t+1 [f(x t+1 ) f(y t+1 ) + δ/] + V zt (u) V zt+1 (u) τ t = (A t+1 α t+1 )[f(y t ) f(x t+1 )] + A t+1 [f(x t+1 ) f(y t+1 ) + δ/] + V zt (u) V zt+1 (u). A t+1 f(y t+1 ) A t f(y t ) + V zt+1 (u) V zt (u) α t+1 f(u) + A t+1 δ. Summing these inequlities, we obtain A T f(y T ) + [V zt+1 (u) V z0 (u)] A T f(u) + 11 T t=1 A t δ.

12 Now, by Proosition 4.7, we have 1 T t=1 A A t 1 T γ T γ T T f(y T ) f(u) V z 0 (u) A T + ε. T, thus by the choice δ = ε/t, we obtain Definition 4.5 together with the fact that Ψ(x), y x 0 when x minimizes Ψ(x) over Q then yields the desired result. In order to obtain the convergence rate of the method, we need to estimate the value of A T given the choice of M. For this we assume the bound in (15) is satisfied with equality. Since A T = γ T /M we can use Proosition 4.7 (iii), so that [ ( ) ] A T = γ ε T L T T +1 ε 1 [ ( )] L. Notice that to obtain an ε-solution it suffices to have A T D Ψ (Q)/ε. By imosing this lower bound on the lower bound obtained for A T we get the following comlexity estimate. Corollary Let f F (Q, L ) and Ψ : X R be -uniformly convex w.r.t.. Setting δ ε/t, and M satisfying (15), the accelerated method in Algorithm requires [ ( ) ] T < D Ψ (Q) L 1 (+1) ε + 1. iterations to reach an accuracy ε. We will later see that the algorithm above leads to otimal comlexity bounds (that is, unimrovable u to constant factors), for l -setus. However, our algorithm is highly sensitive to several arameters, the most imortant being (the smoothness) and L which sets the ste-size. We now focus on designing an adative ste-size olicy, that does not require L as inut, and adats itself to the best weak smoothness arameter (1, ] An Adative Gradient Method. We will now extend the adative algorithm in [Nesterov, 015, Th.3] to handle -uniformly convex rox functions using Bregman divergences. This new method with adative ste-size olicy is described as Algorithm 3. From line 5 in Algorithm 3 we get the following identities A 1 = α M (1) 1 τ = MA. () These identities are analogous to the ones derived for the non-adative variant. For this reason, the analysis of the adative variant is almost identical. There are a few extra details to address, which is what we do now. First, we need to show that the line-search rocedure is feasible. That is, it always terminates in finite time. This is intuitively true from Lemma 4.8, but let us make this intuition recise. From (1) and () we have Mτ = A 1 α ( α A ) 1 = 1 α ( A α ) 1 α. Notice that whenever the condition (0) of Algorithm 3 is not satisfied, M is increased by a factor two. Suose the line-search does not terminate, then α 0. However, by Lemma 4.8, the termination condition (0) is guaranteed to be satisfied as soon as [ ( ) ] 1 M L, ετ 1

13 Algorithm 3 Accelerated Method with Bregman Prox and Adative Stesize Inut: x 0 Q 1: Set y 0 = x 0, z 0 = x 0, M 0 = 1 and A 0 = 0. : for t = 0,..., T 1 do 3: M = M t / 4: reeat 5: Set M = M α = max A = A t + α τ = α/a x t+1 = τz t + (1 τ)y t {a : M 1 1 a a = A t } 6: Obtain f(x t+1 ), and comute 7: until y t+1 = arg min y Q { M y x t+1 + f(x t+1 ), y x t+1 f(y t+1 ) f(x t+1 ) + f(x t+1 ), y t+1 x t+1 + M y t+1 x t+1 + τε } (0) 8: Set M t+1 = M/, α t+1 = α, A t+1 = A, τ t = τ. 9: Comute 10: end for 11: return y z t+1 = arg min z Q {V z t (z) + α t+1 f(x t+1 ), z z t } which is a contradiction with α 0. To roduce convergence rates, we need a lower bound on the sequence A t. Unfortunately, the analysis in [Nesterov, 015] only works when =, we will thus use a different argument. First, notice that by the line-search rule from which we obtain M t+1 [ ( ) 1 ετ t ] L, This allows us to conclude α t+1 = τ t A t+1 = A 1 t+1 1 A 1 1 t+1 [ ε M t+1 [ ( ( α (+1) t+1 1 [ ( ε )] 13 ) ετ t ] 1 L L )] A t+1 α 1 t+1. L A t+1,

14 which gives an inequality involving α t+1 and A t+1 ( [ ( α t+1 (+1) ε )] (+1) L (+1) ) A (+1) t+1. Here is where we need to deart from Nesterov s analysis, as the condition γ 1/ in that roof does not hold. Instead, we show the following bound. Lemma Suose α t 0, α 0 = 0 and A t = t j=0 α j, satisfy for some s [0, 1[ and β 0. Then, for any t 0. α t βa s t A t ((1 s)βt) 1 1 s Proof. The sequence A t follows the recursion A t A t 1 βa s t. The function h(x) x βx s satisfies h(0) = 0, h (0 + ) < 0 and h (x) only has a single ositive root. Hence, when A t 1 > 0, the equation A t βa s t = A t 1 in the variable A t only has a single ositive root, after which h(a t ) is increasing. This means that to get a lower bound on A t it suffices to consider the extreme case of the sequence satisfying A t A t 1 = βa s t. Because A t is increasing, the sequence A t A t 1 is increasing, hence there exists an increasing, convex, iecewise affine function A(t) that interolates A t, whose breakoints are located at integer values of t. By construction, this function A(t) satisfies for any t / N. In articular, the interolant satisfies A (t) = A t+1 A t = α t+1 βa s t+1 βa(t)s A (t) βa s (t) (3) for any t 0. Note that 1/A s (t) is a convergent integral around 0, as A(t) is linear around 0, and A ( ) can be defined as a right continuous nondecreasing function, which is furthermore constant around 0; therefore the involved functions are integrable, and the Theorem of change of variables holds. Integrating the differential inequality we get t A (t) A(t) βt 0 A s (t) dt = du 0 u s = A(t)1 s 1 s, yielding the desired result. Using Lemma 4.11 with s = ( )/(( + 1) ) roduces the following bound on A T A T 1 ( ) (+1) ( ) ε L T (+1). ( + 1) To guarantee that A T D Ψ (Q)/ε, it suffices to imose ( D T C(, ) Ψ (Q)L where ε ( ) ( ( + 1) ( ) C(, ) 14 ) 1 (+1) ) (+1) (+1).

15 Corollary 4.1. Let f F (Q, L ) and Ψ : X R is -uniformly convex w.r.t.. Then the number of iterations required by Algorithm 3 to roduce a solution with accuracy ε is bounded by [ ( D T C(, ) Ψ (Q)L ) 1 ] (+1). inf 1< From Corollary 4.1 we obtain the affine-invariant bound on iteration comlexity. Given a centrally symmetric convex body Q R n, we choose the norm as its Minkowski gauge = Q, and - uniformly convex rox as the minimizer defining the otimal -variation constant, su x Q Ψ(x) = D,Q. With these choices, the iteration comlexity is T inf 1< ε C(, ) ( D,Q L,Q ε ) 1 (+1), where L,Q is the Hölder constant of f quantified in the Minkowski gauge norm Q. As a consequence, the bound above is affine-invariant, since also D,Q is affine-invariant by construction. Observe our iteration bound automatically adats to the best ossible weak smoothness arameter (1, ]; note however that an imlementable algorithm requires an accuracy certificate in order to sto with this adative bound. These details are beyond the scoe of this aer, but we refer to [Nesterov, 015] for details. Finally, we will see in what follows that this affine invariant bound also matches corresonding lower bounds when Q is an l ball. 5. EXPLICIT BOUNDS ON PROBLEMS OVER l BALLS 5.1. Uer Bounds. To illustrate our results, first consider the roblem of minimizing a smooth convex function over the unit simlex, written minimize f(x) subject to 1 T x 1, x 0, in the variable x R n. As discussed in [Juditsky et al., 009, 3.3], choosing 1 as the norm and d(x) = log n+ n i=1 x i log x i as the rox function, we have = 1 and d(x ) log n, which means the comlexity of solving (4) using Algorithm 1 is bounded by 8 L 1 log n (5) ε where L 1 is the Lischitz constant of f with resect to the l 1 norm. This choice of norm and rox has a double advantage here. First, the rox term d(x ) grows only as log n with the dimension. Second, the l norm being the smallest among all l norms, the smoothness bound L 1 is also minimal among all choices of l norms. Let us now follow the construction of Section 3. The simlex C = {x R n : 1 T x 1, x 0} is not centrally symmetric, but we can symmetrize it as the l 1 ball. The Minkowski norm associated with that set is then equal to the l 1 -norm, so Q = 1 here. The sace (R n, ) is log n regular [Juditsky and Nemirovski, 008, Examle 3.] with the rox function chosen here as α/, with α = log n/( log n 1). Proosition 3.9 then shows that the comlexity bound we obtain using this rocedure is identical to that in (5). A similar result holds in the matrix case Strongly Convex Prox. We can generalize this result to all cases where Q is an l ball. When [1, ], [Juditsky et al., 009, Ex. 3.] shows that the dual norm is regular, with 1 { } = inf (ρ 1)n ρ ( 1) min ρ< 1, C log n, when [1, ] (4)

16 When [, ], the regularity is only controlled by the distortion d( 1, ), since α is only smooth when α. This means that 1 is regular, with This means that the comlexity of solving = n, when [, ]. minimize f(x) (6) subject to x B in the variable x R n, where B is the l ball, using Algorithm 1, is bounded by 4L where L is the Lischitz constant of f with resect to the l norm. We will later see that this bound is nearly otimal when 1 ; however, the dimension deendence on the bounds when > is essentially subotimal. In order to obtain the otimal methods in this range we will need our -uniformly convex extensions Uniformly Convex Bregman Prox. In the case <, the function Ψ (w) = 1 w is - uniformly convex w.r.t. (see, e.g. [Ball et al., 1994]), and thus D,B = 1, ε when [, ]. As a consequence, Algorithm with Ψ as -uniformly requires T C() ( L ε ) + iterations to reach a target recision ε, where C() is a constant only deending on (which nevertheless diverges as ). This comlexity guarantee admits assage to the limit with a oly-logarithmic extra factor. Note however that in this case we can avoid any dimension deendence by the much simler Frank-Wolfe method. 5.. Lower Bounds. We show that in the case of l balls estimates from the roosed methods are nearly otimal in terms of information-based comlexity. We consider the class of roblems given by the minimization of smooth convex objectives with a bound L on the Lischitz constant of their gradients w.r.t. norm, and the feasible domain given the radius R > 0 ball B (R). We emhasize that the lower bounds we resent only hold for the large-scale regime, where the number of iterations T is uer bounded by the dimension of the sace, n. It is well-known that when one can afford a suer-linear (in dimension) number of iterations, methods such as the center of gravity or ellisoid can achieve better comlexity estimates [Nemirovskii and Yudin, 1979]. First, in the range 1 we can immediately use the lower bound on risk from [Guzmán and Nemirovski, 015], Ω ( L R T log(t + 1) where T is the number of iterations, which translates into the following lower bound on iteration comlexity Ω L R ε log n as a function of the target recision ε > 0. Therefore, the affine invariant algorithm is otimal, u to oly-logarithmic factors, in this range. 16 ) (7) (8)

17 For the second range, <, the lower bound states the accuracy after T stes is no better than ( ) L R Ω min[, log n] T 1+/, which translates into the iteration comlexity lower bound ( ) Ω L R +. min[, log n]ε For fixed <, this lower bound matches u to constant factors our iteration comlexity obtained for these setus. For the case =, our algorithm also turns out to be otimal, u to olylogarithmic in the dimension factors. 6. NUMERICAL RESULTS We now briefly illustrate the numerical erformance of our methods on a simle roblem taken from [Nesterov, 015]. To test the adativity of Algorithm 3, we focus on solving the following continuous Steiner roblem min x 1 m x x i q (9) i=1 in the variable x R n, with arameters x i R n for i = 1,..., m. The arameter q [1, ] controls the Hölder continuity of the objective. We samle the oints x uniformly at random in the cube [0, 1] n. We set n = 50, m = 10 and the target recision ε = We comare iterates with the otimum obtained using CVX [Grant et al., 001]. We observe that while the algorithm solving the three cases q = 1, 1.5, is identical, it is significantly faster on smoother roblems, as forecast by the adative bound in Corollary 4.1. ft f 10 5 q=1 q=1.5 q= Iteration M q=1 q=1.5 q= Iteration FIGURE 1. We test the adativity of of Algorithm 3. Left: Convergence lot of Algorithm 3 alied to the continuous Steiner roblem (9) for q = 1, 1.5,. Right: Value of the local smoothness arameter M across iterations. 17

18 7. CONCLUSION From a ractical oint of view, the results above offer guidance in the choice of a rox function deending on the geometry of the feasible set Q. On the theoretical side, these results rovide affine invariant descritions of the comlexity of an otimization roblem based on both the geometry of the feasible set and of the smoothness of the objective function. In our first algorithm, this comlexity bound is written in terms of the regularity constant of the olar of the feasible set and the Lischitz constant of f with resect to the Minkowski norm. In our last two methods, the regularity constant is relaced by a Bregman diameter constructed from an otimal choice of rox. When Q is an l ball, matching lower bounds on iteration comlexity for the algorithm in [Nesterov, 1983] show that these bounds are otimal in terms of target recision, smoothness and roblem dimension, u to a olylogarithmic term. However, while we show that it is ossible to formulate an affine invariant imlementation of the otimal algorithm in [Nesterov, 1983], we do not yet show that this is always a good idea outside of the l case... In articular, given our choice of norm the constants L Q and Q are both affine invariant, with L Q otimal by our choice of rox function minimizing Q over all smooth square norms. However, outside of the cases where Q is an l ball, this does not mean that our choice of norm (Minkowski gauge of a centrally symmetric feasible set) minimizes the roduct L Q min{ Q /, n}, hence that we achieve the best ossible bound for the comlexity of the smooth algorithm in [Nesterov, 1983] and its derivatives. Furthermore, while our bounds give clear indications of what an otimal choice of rox should look like, given a choice of norm, this characterization is not constructive outside of secial cases like l -balls. ACKNOWLEDGMENTS AA and MJ would like to acknowledge suort from the Euroean Research Council (roject SIPA). MJ also acknowledges suort from the Swiss National Science Foundation (SNSF). The authors would also like to acknowledge suort from the chaire Économie des nouvelles données, the data science joint research initiative with the fonds AXA our la recherche and a gift from Société Générale Cross Asset Quantitative Research. REFERENCES Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear couling: An ultimate unification of gradient and mirror descent. arxiv rerint arxiv: , 014. K. Ball, E.A. Carlen, and E.H. Lieb. Shar uniform convexity and smoothness inequalities for trace norms. Inventiones mathematicae, 115(1):463 48, doi: /BF S. Boyd and L. Vandenberghe. Convex Otimization. Cambridge University Press, 004. O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex otimization with inexact oracle. CORE Discussion Paers,(011/0), 011. M. Grant, S. Boyd, and Y. Ye. CVX: Matlab software for discilined convex rogramming Cristóbal Guzmán and Arkadi Nemirovski. On lower comlexity bounds for large-scale smooth convex otimization. Journal of Comlexity, 31(1):1 14, 015. Jean-Batiste Hiriart-Urruty and Claude Lemaréchal. Convex Analysis and Minimization Algorithms. Sringer, A. Juditsky and A.S. Nemirovski. Large deviations of vector-valued martingales in -smooth normed saces. arxiv rerint arxiv: , 008. A. Juditsky, G. Lan, A. Nemirovski, and A. Shairo. Stochastic aroximation aroach to stochastic rogramming. SIAM Journal on Otimization, 19(4): , 009. A. Nemirovskii and D. Yudin. Problem comlexity and method efficiency in otimization. Nauka (ublished in English by John Wiley, Chichester, 1983),

19 AS Nemirovskii and Yu E Nesterov. Otimal methods of smooth convex minimization. USSR Comutational Mathematics and Mathematical Physics, 5():1 30, Y. Nesterov. A method of solving a convex rogramming roblem with convergence rate O(1/k ). Soviet Mathematics Doklady, 7():37 376, Y. Nesterov. Introductory Lectures on Convex Otimization. Sringer, 003. Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):17 15, 005. Y. Nesterov and A. Nemirovskii. Interior-oint olynomial algorithms in convex rogramming. Society for Industrial and Alied Mathematics, Philadelhia, Yu Nesterov. Universal gradient methods for convex otimization roblems. Mathematical Programming, 15(1-): , 015. Gilles Pisier. Martingales with values in uniformly convex saces. Israel Journal of Mathematics, 0(3-4):36 350, Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In Advances in Neural Information Processing Systems 4: 5th Annual Conference on Neural Information Processing Systems 011., ages , 011. CNRS & D.I., UMR 8548, ÉCOLE NORMALE SUPÉRIEURE, PARIS, FRANCE. address: asremon@ens.fr FACULTAD DE MATEMÁTICAS & ESCUELA DE INGENIERÍA, PONTIFICIA UNIVERSIDAD CATÓLICA DE CHILE address: crguzman@uc.cl ECOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE, SWITZERLAND. address: m.jaggi@gmail.com 19

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,