Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Size: px

Start display at page:

Download "Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent"

Sophie Pope
5 years ago
Views:

1 Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order methods for non-smooth (stochastic or deterministic) convex optimization assumes that the objective function is uniformly Lipschitz continuous with parameter M f. However, in many settings the non-differentiable convex function f is not uniformly Lipschitz continuous for example (i) the classical support vector machine (SVM) problem, (ii) the problem of minimizing the maximum of convex quadratic functions, and even (iii) the univariate setting with f(x) := max{0, x} + x. Herein we develop a notion of relative continuity that is determined relative to a user-specified reference function h (that should be computationally tractable for algorithms), and we show that many non-differentiable convex functions are relatively continuous with respect to a correspondingly fairly-simple reference function h. We also similarly develop a notion of relative stochastic continuity for the stochastic setting. We analyze two standard algorithms the (deterministic) mirror descent algorithm and the stochastic mirror descent algorithm for solving optimization problems in these new settings, providing the first computational guarantees for instances where the objective function is not uniformly Lipschitz continuous. This paper is a companion paper for non-differentiable convex optimization to the recent paper by Lu, Freund, and Nesterov, which developed analogous results for differentiable convex optimization. Introduction The usual approach to developing and analyzing first-order methods for non-differentiable convex optimization (which we review shortly) assumes that the objective function is uniformly Lipschitz continuous, in both deterministic and stochastic settings. However, in many settings the objective function f is not uniformly Lipschitz continuous. For example, consider the Support Vector Machine problem (SVM) for binary classification in machine learning, whose optimization formulation is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, MIT Department of Mathematics and MIT Operations Research Center, 77 Massachusetts Avenue, Cambridge, MA 039 (haihao@mit.edu). The author s research is supported by AFOSR Grant No. FA and the MIT-Belgium Université Catholique de Louvain Fund.

2 where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees for SVM. Another example is the the problem of computing a point x R m in the intersection of n ellipsoids, which can be tackled via the optimization problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, where the i th ellipsoid is Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a symmetric positive semi-definite matrix, i =,..., n. Observe that the objective function f is both non-differentiable and non-lipschitz, and so it falls outside of the scope of standard classes of optimization problems for which first-order methods are guaranteed to work. Nevertheless, using the machinery developed in this paper, we will show in Section 5.3 how to solve both of these problems using deterministic or stochastic Mirror Descent. In this paper we develop a general theory and algorithmic constructs that overcome the drawbacks in the usual analyses of first-order methods that are grounded on restricted notions of uniform Lipschitz continuity. Here we develop a notion of relative continuity with respect to a given convex reference function h, a notion which does not require the specification of any particular norm indeed h need not be strongly (or even strictly) convex. Armed with relative continuity, we demonstrate the capability to solve a more general class of non-differentiable convex optimization problems (without uniform Lipschitz continuity) in both deterministic and stochastic settings. This paper is a companion for non-differentiable convex optimization to our predecessor paper [8] for differentiable convex optimization. In [8], with a very similar philosophy, we developed the notion of relative smoothness and relative strong convexity with respect to a given convex reference function. In that paper we demonstrated the capability to solve a more general class of differentiable convex optimization problems (without uniform Lipschitz continuous gradients), and we also demonstrated linear convergence results for a Primal Gradient Scheme when the objective function f is both smooth and strongly convex. There are some concurrent works on smooth optimization sharing a similar spirit to [8]. Bauschke, Bolte, and Teboulle [] presents a similar definition of relative smoothness as in [8] and analyzes the convergence of Mirror Descent Algorithm, although their algorithm and convergence complexity depend on a symmetry measure of the Bregman distance. Zhou et al. [7] discusses a unified proof of Mirror Descent and the Proximal Point Algorithm under a similar assumption of relative smoothness. Nguyen [5] develops similar ideas on analyzing Mirror Descent in a Banach space. A more detailed discussion comparing these related works is also presented in [8]. More recently, Hanzely and Richtarik [6] develop stochastic algorithms for the relatively smooth optimization setting. In Section we review the traditional set-up for Mirror Descent in both the deterministic and stochastic settings. In Section 3 we introduce our notion of relative continuity in both the deterministic and stochastic settings, together with some relevant properties. In Section 4 we prove computational guarantees associated with the Mirror Descent and Stochastic Mirror Descent

3 algorithms under relative continuity. In Section 5 we show constructively how our ideas apply to a large class of non-differentiable and non-lipschitz convex optimization problems that are not otherwise solvable by traditional first-order methods. Also in Section 5 we analyze computational guarantees associated with Mirror Descent and Stochastic Mirror Descent for the Intersection of Ellipsoids Problem (IEP) and also the Support Vector Machine (SVM) problem. Notation. denotes a given norm in R n and denotes the usual dual norm on the dual space. x := x T x denotes the Euclidean (inner product) norm, where x T means the transpose of the vector x, and B (c, r) := {x R n : x c r}. A denotes the l (spectral) norm of a matrix A. The inner product, specifically denotes the dot inner product in the underlying vector space. For a conditional random variable s(x) given x, E[s(x) x] denotes the conditional expectation of s(x) given x. Traditional Mirror Descent The optimization problem of interest is: P : min x f(x) s.t. x Q, where Q R n is a closed convex set and f : Q R is a convex function that is not necessarily differentiable. There are very many deterministic and stochastic first-order methods for tackling (), see for example [4], [], [9] and the references therein. Virtually all such methods are designed to solve () when the objective function f satisfies a uniform Lipschitz continuity condition on Q, which in the deterministic setting is (essentially) equivalent to the condition that there exists a constant M f < for which: g(x) M f for all x Q and g(x) f(x), () where f(x) is the subdifferential of f at x (i.e., the collection of subgradients of f at x), is a given norm on R n, and denotes the usual dual norm. Here we use g(x) to denote an assignment of a subgradient (or an oracle call thereof) at x, and so g(x) is not a function nor is it a point-to-set map. Another useful functional notion is strong convexity: f is (uniformly) µ f -strongly convex for some µ f > 0 if f(y) f(x) + g(x), y x + µ f y x for all x, y Q and g(x) f(x). (3) Nedić and Lee [9] obtain improved convergence guarantees for the stochastic mirror descent algorithm under strong convexity, for example. (). Deterministic Setting Let us now recall the Mirror Decent Algorithm (see [0] and []), which is also referred to as the prox subgradient method when interpreted in the space of primal variables. Mirror Descent employs a 3

4 differentiable convex prox function h to define a Bregman distance: D h (y, x) := h(y) h(x) h(x), y x for all x, y Q. (4) The Bregman distance is used in the computation of the Mirror Descent update: { } x i+ arg min f(x i ) + g(x i ), x x i + x Q t i D h (x, x i ), where {t i } is the sequence of step-sizes for the scheme. A formal statement of the Mirror Descent Algorithm is presented in Algorithm. The traditional set-up requires that h is -strongly convex with respect to the given norm, and in this set-up one can prove that after k iterations it holds for any x Q that: min 0 i k f(xi ) f(x) M f i=0 t i + D h(x, x 0 ) i=0 t i, (5) which leads to an O(/ k) sublinear rate of convergence using an appropriately chosen step-size sequence {t i }, see []. Algorithm Deterministic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex function. At iteration i : Perform Updates. Compute g(x i ) f(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(xi ) + g(x i ), x x i + t i D h (x, x i )}. (6) Notice in (6) by construction that the update requires the capability to solve instances of a linearized subproblem (which we denote by LS ) of the general form: LS : x new arg min{ c, x + h(x)}, (7) x Q for suitable iteration-specific values of c. Indeed, (6) is an instance of (7) with c = t i g(x i ) h(x i ) at iteration i. It is especially important to note that the Mirror Descent update (6) is somewhat meaningless absent the capability to efficiently solve (7), a fact which we return to later. In the usual design and implementation of Mirror Descent for solving (), one attempts to specify the norm and the -strongly convex prox function h in consideration of properties of the feasible domain Q while also ensuring that the LS subproblem (7) is efficiently solvable. Notice that the Mirror Descent Algorithm (Algorithm ) itself does not require the traditional set-up that h be -strongly convex for some particular norm; rather this requirement is part of the traditional analysis. As we will see, we can instead analyze Mirror Descent by considering the intrinsic ways that f and D h (, ) are related functionally, in a manner that is constructive in terms of actual algorithm design and implementation. Furthermore, this is in the same spirit as was done in the predecessor paper [8]. 4

5 . Stochastic Setting For some convex functions, computing an exact subgradient at x Q may be expensive or even intractable, but sampling a random stochastic estimate of a subgradient at x, which we denote by g(x), may be easy. We say that g(x) is an unbiased stochastic subgradient if E [ g(x) x] f(x). The usefulness of a stochastic subgradient methodology is easily seen in the context of machine and statistical learning problems. A prototypical learning problem is to compute an approximate solution of the following empirical loss minimization problem: min x f(x) := n n j= f j(x) s.t. x Q, where f j is a non-differentiable convex loss function associated with sample j, for j =,..., n data samples. When n 0, the standard subgradient method needs to evaluate n subgradients in order to compute a subgradient of f, which can be prohibitively expensive. A typical alternative is to compute a stochastic subgradient. Letting x i denote the i th iterate, at iteration i a single sample index j is drawn uniformly and independently on {,..., n}, and then a subgradient g f j (xi ) is computed that is used to define g(x i ) := g. This stochastic subgradient is then used in place of a subgradient at iteration i. Notice that by construction g(x i ) is a conditional random variable given x i, and g(x i ) is an unbiased stochastic subgradient, namely E[ g(x i ) x i ] f(x i ). A stochastic version of Mirror Descent is presented in Algorithm. The structure of Stochastic Mirror Descent is identical to that of Mirror Descent, the only difference being that the stochastic estimate of a subgradient g(x i ) replaces the exact subgradient g(x i ) in Algorithm. Algorithm Stochastic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex differentiable function. At iteration i : Perform Updates. Compute a stochastic subgradient g(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(x i ) + g(x i ), x x i + t i D h (x, x i )}. (8) A standard condition that is required in the traditional convergence analysis for Stochastic Mirror Descent (as well as other stochastic first-order methods) is that there exists G f > 0 for which: E[ g(x) x] G f, for any x Q. (9) For notational convenience, we say that f is G f -stochastically continuous if (9) holds. In [9], Nedić and Lee developed convergence results for Stochastic Mirror Descent (Algorithm ). Under the conditions that (i) f is G f -stochastically continuous, (ii) h is a differentiable and µ h -strongly convex function on Q, and (iii) Q is a closed bounded set, Nedić and Lee ([9] equation (7)) show the following convergence result using step-sizes t i = µh D max G f (i+) : [ ] E f( x k ) f 3G f Dmax µ h (k + ), (0) 5

6 where x k := i=0 t i i=0 t ix i and D max := max x,y Q D h (x, y). Furthermore, if also (a) f is µ f -strongly convex, and (b) h is L h -smooth, Nedić and Lee ([9] Theorem ) show that with step-sizes t i = L h µ f (i+) it holds that: where ˇx k := (i+)(i+) i=0 (i + )xi. [ ] E f(ˇx k ) f G f L h µ f (k + )µ h, () 3 Relative Continuity In this section we introduce our definition of relative continuity of a function f actually two different definitions one for the deterministic and another for the stochastic setting. The starting point is a reference function h which is a given differentiable convex function on Q that is used to construct the usual Bregman distance D h (, ) (4), and that is used as part of the Mirror Descent update (6). However, we point out for emphasis that unlike the traditional set-up there are no assumptionss on h (such as strong or strict convexity). 3. Deterministic Setting Consider the objective function f of (). We define relative continuity of f relative to the reference function h using the Bregman distance D h (, ) of h as follows. Definition 3.. f is M-relative continuous with respect to the reference function h on Q if for any x, y Q, x y, and g(x) f(x), it holds that g(x) M D h (y, x) y x. () (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, and the relative continuity condition () becomes g(x) M, which corresponds to the standard definition of Lipschitz continuity () for the l -norm.) We can rewrite () as g(x) M D h(y, x), (3) y x which states that the square of the norm of any subgradient is bounded by the ratio of the Bregman distance D h (y, x) to y x. The following proposition presents the key property of an M-relative continuous function that is used in the proofs of results to follow. 6

7 Proposition 3.. (Key property of M-relative continuity) If f is M-relative continuous with respect to the reference function h, then for any t > 0 it holds for all x, y Q and g(x) f(x) that: t D h(y, x) + g(x), y x + tm 0. (4) Proof: If f is M-relative continuous with respect to h, then for any t > 0 it follows that g(x), y x g(x) y x M D h (y, x) tm + D h(y, x) t where the last inequality is an application of the arithmetic-geometric mean inequality. The proof follows by rearranging terms. The key property (4) is what is used in the proofs of results herein, so we could define M-relative continuity using (4) instead of (). Furthermore, (4) is independent of any norm structure, and so is attractive for its generality. However, we use the definition () because it leads to easy verification of M-relative continuity in practical instances as is shown in Section 5. The following proposition presents some scaling and additivity properties of relative continuity. Proposition 3.. Additivity of Relative Continuity. If f is M-relative continuous with respect to h, then for any α > 0, f is M α -relative continuous with respect to α h.. If f is M-relative continuous with respect to h, then for any α > 0, αf is M-relative continuous with respect to α h. 3. If f j is M-relative continuous with respect to h j for j =,..., n, then n j= f j is nm-relative continuous with respect to n j= h j. 4. If f j is M j -relative continuous with respect to h j for j =,..., n, then for α j > 0 and M > 0 it Proof: holds that n j= α jf j is nm-relative continuous with respect to n j=. It holds that Let x, y Q, x y, and g(x) f(x). which establishes the result. g(x) M D h (y, x) y x = M α Dα h(y, x) y x, α j βj, h j with β j := M M j.. Notice that g(x) is a subgradient of f(x) if and only if αg(x) is a subgradient of αf(x), whereby which establishes the result. αg(x) = α g(x) αm D h (y, x) y x = M D α h(y, x), y x 3. Any subgradient of n j= f j at x can be written as n j= g j(x) where g j (x) f j (x) for j =,..., n (see Theorem B. of [3]). From the triangle inequality and the relative continuity of 7

8 f j we have: n g j (x) j= n g j (x) j= M ( n j= ) D hj (y, x) y x nm ( ) n j= D h j (y, x) y x = nm Dh + +h n (y, x) y x, where the third inequality is an application of the l /l -norm inequality applied to the n-tuple ( D h (y, x),..., D hn (y, x)). 4. It follows from part (.) that α j f j is M j -continuous relative to α j h j. Thus α j f j is also M- continuous relative to α j h βj j from part (.), whereby the proof is finished by utilizing part (3.). We also make use of the notion of relative strong convexity which was introduced in [8], and is used here in some of the convergence guarantee analyses. Definition 3.. f is µ-strongly convex relative to h on Q if there is a scalar µ 0 such that for any x, y int Q and any g(x) f it holds that f(y) f(x) + g(x), y x + µd h (y, x). (5) In [8] it was shown that the notion of relative strong convexity embodied in Definition 3. is the natural way to define strong convexity in the context of mirror descent and similar algorithms, and leads to linear convergence of mirror descent in the smooth setting. In the non-smooth setting, we will show in Theorem 4. that relative strong convexity improves the convergence of mirror descent from O(/ k) to O(/k). 3. Stochastic Setting For x Q, let g(x) denote a random stochastic estimate of a subgradient of f at x. Extending the definition of relative continuity from the deterministic setting, we define stochastic relative continuity as follows. Definition 3.3. f is G-stochastically-relative continuous with respect to the reference function h on Q for some G > 0 if f together with the oracle to compute a stochastic subgradient satisfies:. Unbiasedness property: E[ g(x) x] f(x), and. Boundedness property: E[ g(x) x] G D h (y,x) y x for all x, y Q and x y. (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, whereby the stochastically-relative continuity boundedness property becomes E[ g(x) x] G for all x Q, which corresponds to the standard condition (9) for the l -norm.) 8

9 For x Q, define M(x) := g(x) max y Q,y x y x Dh (y, x). (6) Notice for a given x that max y Q,y x y x Dh (y,x) is a deterministic quantity, and therefore M(x) is a conditional random variable (given x) that is defined on the same probability space as g(x). Clearly, if f is G-stochastically-relative continuous, we have by the boundedness property that for any x Q E[ M(x) x] = E[ g(x) x] max y Q,y x Exactly as in the deterministic setting, we have: y x D h (y, x) G. (7) Proposition 3.3. If f is G-stochastically-relative continuous with respect to the reference function h on Q, then for any t > 0 it holds for all x, y Q and any stochastic subgradient estimate g(x) that: t D h(y, x) + g(x), y x + t M (x) 0. Proof: For any t > 0, we have g(x), y x g(x) y x M(x) D h (y, x) t M(x) + D h(y, x) t and the proof is finished by rearranging terms., 4 Computational Analysis for Stochastic Mirror Descent and (Deterministic) Mirror Descent In this section we present computational guarantees for Stochastic Mirror Descent (Algorithm ) for minimizing a convex function f that is G-stochastically-relative continuous with respect to a given reference function h. We also present computational guarantees for (deterministic) Mirror Descent (Algorithm ) when f is M-relative continuous with respect to a reference function h, which follows as a special case of the stochastic setting. We begin by recalling the standard Three-Point Property for optimization using Bregman distances: Lemma 4.. (Three-Point Property (Tseng [4])) Let φ(x) be a convex function, and let D h (, ) be the Bregman distance for h. For a given vector z, let z + := arg min x Q {φ(x) + D h(x, z)}. Then φ(x) + D h (x, z) φ(z + ) + D h (z +, z) + D h (x, z + ) for all x Q. 9

10 Let us denote the (primitive) random variable at the i th iteration of the Stochastic Mirror Descent Algorithm (Algorithm ) by γ i, i.e., γ i is the random variable that determines the (stochastic) subgradient g(x i ) at iterate x i in the Stochastic Mirror Descent Algorithm. Then x i+ is computed according to the update of the Stochastic Mirror Descent Algorithm, whereby x i+ is a random variable which depends on all previous values γ 0,..., γ i and we denote this string of random variables by ξ i := {γ 0,..., γ i }. The following theorem states convergence guarantees for the Stochastic Mirror Descent Algorithm in terms of expectation. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm) Consider the Stochastic Mirror Descent Algorithm (Algorithm ) with given step-size sequence {t i }. If f is G-stochastically-relative continuous with respect to h for some G > 0, then the following inequality holds for all k and x Q: where x k := i=0 t i i=0 t ix i. ] E ξk [f( x k ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i, (8) Proof: First notice that f(x i ) + g(x i ), x x i = f(x i ) + E γi [ g(x i ) x i ], x x i = f(x i ) + E γi [ g(x i ), x x i x i ] f(x i ) + E γi [ g(x i ), x i+ x i + t i D h (x i+, x i ) + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) + E γi [ M(x i ) t i + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ), where the first equality uses the unbiasedness of g(x), the second equality is because of linearity, the first inequality is from the Three-Point Property with φ(x) = t i g(x i ), x x i, the second inequality uses Proposition 3.3, and the last inequality uses (7). Since also f(x) f(x i ) + g(x i ), x x i from the definition of a subgradient, we have from (9): f(x) f(x i ) + g(x i ), x x i f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ). Taking expectation with respect to ξ i on both sides of the above inequality yields: f(x) E ξi [f(x i )] G t i + t i E ξi [D h (x, x i+ )] t i E ξi [D h (x, x i )] (0) Now rearrange and multiply through by t i to yield: t i E ξi [f(x i ) f(x)] G t i + E ξi [D h (x, x i )] E ξi [D h (x, x i+ )]. (9) 0

11 Summing up the above inequality over i and noting that D h (x, x k+ ) 0 we arrive at: G k i=0 t i + D h(x, x 0 ) i=0 t ie ξi [f(x i ) f(x)] = E k ξk i=0 t i[f(x i ) f(x)] ( k i=0 t i) E ξk [ f( x k ) f(x) ], where the last inequality uses the convexity of f. Dividing by i=0 t i completes the proof. Remark 4.. As a direct consequence of (), we obtain the following result which is similar to the deterministic setting (5): [ ] E ξk min 0 i k f(xi ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i. () Theorem 4. implies the following high-probability result using a simple Markov bound. Corollary 4.. Let x be an optimal solution of (). Under the conditions of Theorem 4., for any δ > 0 it holds that: [ ] P f( x k ) f δ Proof: Using the Markov inequality, we have: [ ] P f( x k ) f δ E [ f( x k ) f ] δ G k i=0 t i + D h(x, x 0 ) δ i=0 t i G k i=0 t i + D h(x, x 0 ) δ i=0 t i. Similar to the case of traditional analysis of stochastic mirror descent, the Stochastic Mirror Descent Algorithm (Algorithm ) leads to an O( ε ) convergence guarantee (in expectation) by using an appropriate step-size sequence {t i } as the next corollary shows. Corollary 4.. Under the conditions of Theorem 4., for a given ε > 0 suppose that the step-sizes are set to: t i := ε G for all i. Then within G D h (x, x 0 ) k := iterations of the Stochastic Mirror Descent Algorithm it holds that: [ ] E f( x k ) f ε, where x is any optimal solution of (). Proof: Substituting the values of t i in (8) yields the result directly. ε

12 Remark 4.. Similar to the standard stochastic gradient descent scheme, the step-size rule in Corollary 4. leads to the optimal rate of convergence provided by the bound in Theorem 4.. Remark 4.3. Let us now compare these results to related results of Nedić and Lee [9]. In order to 9G attain an ε-optimality gap, [9] proved a bound of f Dmax 4µ h iterations, which follows by rearranging ε (0). In addition to not requiring Lipschitz continuity of f, our bound does not require that h be strongly convex. We also do not require boundedness of the feasible region; and in most settings D h (x, x 0 ) D max even when D max < +. Furthermore, even in the setting of (0), it holds that: G = E [ g(x) x ] y x max y Q,y x D h (y, x) E [ g(x) x ] µ h G f µ h, where the first inequality utilizes the strong convexity (in the standard sense) of h, and the second inequality is due to the assumption that f is G f -stochasticlly continuous (in the standard sense). Thus we see that the bound in Corollary 4. improves on the bound in [9]. In the case when f is also µ f -strongly convex relative to h (see Definition 3.), we obtain an O( k ) convergence guarantee in expectation, which is also similar to the traditional case of stochastic gradient descent. This is shown in the next result. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm under Strong Convexity relative to h) Consider the Stochastic Mirror Descent Algorithm (Algorithm ). If f is G-stochastically-relative continuous with respect to h for some G > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i = µ(i+), then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. ] E ξk [f(ˆx k ) f G µ(k + ), Proof: For any x Q it follows from the definition of µ-strong convexity (5) that Combining the above inequality with (9) yields Substituting t i = µ(i+) f(x) f(x i ) + g(x i ), x x i + µd h (x, x i ). f(x) f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] ( t i µ)d h (x, x i ). and multiplying by i in the above inequality yields: i ( f(x i ) f(x) ) G i µ(i+) + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ) G µ + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ). Taking expectation over ξ i and summing up the above inequality over i then yields ( k ) k i E ξk [f(ˆx k ) f(x)] i ( E ξi [f(x i )] f(x) ) kg ( µ ) µ k(k+) E ξk [D h (x, x k+ )] kg µ,

13 where the first inequality uses the convexity of f and the observation that E ξi f(x i ) = E ξk f(x i ) for i k. Taking x = x where x is an optimal solution of (), the proof is completed by noticing i = k(k+). Remark 4.4. It may not be easy to find cases when the objective function is both G-stochasticallyrelative continuous and is µ-relatively strongly convex relative to h. However, as long as these properties are satisfied along the path of iterates or around the minimum, one can achieve the faster convergence of Theorem 4.. Remark 4.5. Let us also compare the computational guarantee of Theorem 4. to the results in Nedić and Lee [9]. In order to attain an ε-optimality gap, [9] proved the bound (). First notice that we do not require either that f is uniformly Lipschitz continuous or that h is strongly convex in the traditional sense, or that h is uniformly smooth. However, even if these requirements hold, it follows from Remark 4.3 that G G f µ h, and it also holds that: D f (x, y) µ f x y µ f L h D h (x, y), where the first inequality utilizes that f is µ f strongly convex and the second inequality h is L h smooth in the standard sense. Thus f is at least µ = µ f L h -strongly convex relative to h (this follows by applying Proposition. in [8]). Therefore, even under the stronger requirements of [9], Theorem 4. improves on the corresponding result in [9]. We end this section with a discussion of the deterministic setting, namely the (Deterministic) Mirror Descent Algorithm (Algorithm ). Suppose that there is no stochasticity in the computation of subgradients. We can cast this as an instance of the Stochastic Mirror Descent Algorithm (Algorithm ) wherein g(x) = g(x) f(x) for all x Q. In this case relative stochastic continuity (Definition 3.3) is equivalent to relative continuity (Definition 3.) with the same constant. Thus deterministic Mirror Descent is a special case of Stochastic Mirror Descent, and we have the following computational guarantees as special cases of the stochastic case. Theorem 4.3. (Convergence Bound for Deterministic Mirror Descent Algorithm) Consider the (Deterministic) Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0, then for all k and x Q the following inequality holds: f( x k ) f(x) M i=0 t i + D h(x, x 0 ) i=0 t i, where x k := i=0 t i i=0 t ix i. Corollary 4.3. Under the conditions of Theorem 4.3, for a given ε > 0 suppose that the step-sizes are set to: t i := ε M for all i. Then within M D h (x, x 0 ) k := ε 3

14 iterations of Deterministic Mirror Descent it holds that: f( x k ) f ε, where x k := i=0 t i i=0 t ix i, and x is any optimal solution of (). Theorem 4.4. (Convergence Bounds for Deterministic Mirror Descent with Strong Relative Convexity) Consider the Deterministic Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i =, then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. µ(i+) f(ˆx k ) f M µ(k + ), 5 Specifying a Reference Function h with Relative Continuity for Mirror Descent Let us discuss using either deterministic or stochastic Mirror Descent (Algorithm or Algorithm ) for solving the optimization problem () with objective function f that is M-relative continuous or G-stochastically-relative continuous (respectively) with respect to the reference function h. In order to efficiently execute the update step in Algorithm and/or Algorithm we need h to be such that the linearization subproblem LS (7) is efficiently solvable for any given c. Therefore, in order execute Mirror Descent for solving () using Algorithm or Algorithm, we need to specify a differentiable convex reference function h that has the following two properties: (i) f is M-relative continuous (or G-stochastically-relative continuous) with respect to h on Q for M (or G) that is easy to determine, and (ii) the linearization subproblem LS (7) has a solution, and the solution is efficiently computable. We now discuss quite broadly how to construct such a reference function h with these two properties when g(x) is bounded by a polynomial in x. 5. Deterministic Setting Suppose that g(x) p r ( x ) for all x Q and all g(x) f(x), where p r (α) = r i=0 a iα i is an r-degree polynomial of α whose coefficients {a i } are nonnegative. Let h(x) := r i=0 a i i+ x i+. 4

15 Then the following proposition states that f is -relative continuous with respect to h. This implies that no matter how fast the subgradient of f grows polynomially as x, f is relatively continuous with respect to the simple reference function h, even though f does not exhibit uniform Lipschitz continuity. Proposition 5.. f is -continuous relative to h(x) = r i=0 a i i+ x i+. Proof: Let h i (x) = distance, we have Notice that i+ x i+, then h(x) = r i=0 a ih i (x), and by the definition of Bregman D hi (y, x) = i+ y i+ i+ x i+ x i x, y x = ( i+ y i+ + (i + ) x i+ (i + ) x i x, y ). y i+ + (i + ) x i+ (i + ) x i x, y = ( y i+ + i x i+ i+ ) ( x i y + i+ x i x + y x, y ) 0 + i+ ( x i x + y x, y ) = i+ x i y x, where the inequality above is an application of arithmetic-geometric mean inequality a λ b λ λa + ( λ)b with a = x i+, b = y i+, and λ = i i+. Thus we have ( r r ) D h (y, x) = a i D hi (y, x) y x a i x i. () Therefore i=0 g(x) p r ( x ) = r i=0 i=0 a i x i D h(y, x) y x which shows that f is -relative continuous with respect to h., Solving the linearization subproblem (7). Let us see how we can solve the linearization subproblem (7) for this class of optimization problems. The linearization subproblem (7) can be written as LS : r min c, x + x R n i=0 a i i+ x i+, (3) and the first-order optimality condition is simply: ( r ) c + a i x i x = 0, (4) i=0 5

16 whereby x = θc for some scalar θ 0, and it remains to simply determine the value of the nonnegative scalar θ. In the case when c = 0 we have x = 0 satisfies (4), so let us examine the case when c 0, in which case from (4) θ must satisfy: r a i c i θ i+ = 0, i=0 which implies that θ is the unique positive root of a univariate polynomial monotone in θ 0. For r {0,,, 3} this root can be computed in closed form. Otherwise the root can be computed efficiently (up to machine precision) using any suitable root-finding method. Remark 5.. We can incorporate a simple set constraint x Q in problem (3) provided that we can easily compute the Euclidean projection on Q. In this case, the linearization subproblem (7) can be converted to a -dimensional convex optimization problem, see Appendix A. of [8] for details. 5. Stochastic Setting In the stochastic setting, the stochastic subgradient g(x) is a conditional random variable for a given x. Suppose that E [ g(x) x ] p r ( x ) for all x Q, where p r (α) = r i=0 a iα i is an r-degree polynomial whose coefficients {a i } are nonnegative. Let h(x) := and similar to the deterministic case we have: r i=0 a i i+ x i+, Proposition 5.. f is -stochastically continuous relative to h(x) = r i=0 a i i+ x i+. Proof: For any x, y Q with x y we have: E g(x) x ] p r ( x ) = r i=0 a i x i D h(y, x) y x, (5) where the last inequality follows from (), and thus f is -stochastically continuous relative to h. Solving the linearization subproblem (7). The linear optimization subproblem is identical in structure to that in the deterministic case and so can be solved as discussed at the end of Section Relative Continuity for instances of SVM and IEP Here we examine in detail the two motivating examples stated in the Introduction, namely the Support Vector Machine (SVM) problem, and the Intersection of Ellipsoids Problem (IEP). We first prove the following lemma, which presents upper bounds on the Bregman distances D h (, ) for h(x) = 3 x 3 and h(x) = 4 x 4. 6

17 Lemma 5... Let h(x) := 3 x 3. Then D h(y, x) 3 y x ( y + x ).. Let h(x) := 4 x 4. Then D h(y, x) 4 y x ( y + x + x ). Proof:. D h (y, x) = ( 3 y 3 + x 3 3 x x, y ) ( y 3 + x 3 3 x x, y y y, x + y x x x, y + y x ) 3 = 3 y x ( y + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality twice, and the final equality is from simplifying and combining terms.. D h (y, x) = 4 ( y x 4 4 x x, y ) 4 ( y x 4 4 x x, y + 4 x y 4 x, y ) = 4 y x ( y + x + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality once, and the final equality is from simplifying and combining terms. Support Vector Machine (SVM). The Support Vector Machine (SVM) is an important supervised learning model for binary classification in machine learning. The SVM optimization problem for binary classification is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, (6) where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of the hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees in the analysis of (6). Researchers have developed various approaches to overcome this limitation. For example, [5] introduced a splitting subgradient-type method, where the basic idea is to split the loss function and the regularization terms. [6] introduced a quasi-newton method, where they do not need to worry about the unbounded subgradient. Another approach is to a priori constrain x to lie in an l -ball of radius R for R sufficiently large so that the ball contains the optimal solution, and to project onto this ball at each iteration; in this approach f is Lipschitz continuous in the amended feasible region, see [3]. Indeed, one can show using quadratic 7

18 optimization optimality conditions that it suffices to set R = min{ ( n λ n w ) i, /λ} (see Appendix 5.3) wherein the modulus of Lipschitz continuity in the amended feasible region is at most M n n w i + min{ λ, n n w i }. Furthermore, in [7] the authors show that if the initial point lies within a suitably chosen large ball, then Stochastic Subgradient Descent with a small step-size ensures in expectation that all iterates lie in the large ball, which then ensures that the norms of all subgradients are bounded in expectation. Let us see how we can directly use the constructs of relative continuity to tackle the SVM problem with a suitably designed version of Stochastic Mirror Descent without any projection step to a ball. We can rewrite the objective function of (6) as f(x) = n n f j (x), j= where f j (x) = max{0, y j x T w j } + λ x. We consider computing a stochastic estimate of the subgradient of f by using a single sample index j drawn randomly from {,..., n}, namely g(x) f j (x) where j is drawn uniformly at random from {,..., n}. Then g(x) (λ x + w j ), whereby ( n ) E[ g(x) x] λ x + λ w i x n + n w i, n and notice that the right-hand side is a polynomial in x of degree r =. If we choose the reference function h as ( n ) ( n ) h(x) := λ 4 x 4 + λ w i x 3 + w i x, (7) 3n n it follows from the Proposition 5. that f is -stochastically continuous relative to h(x). Proposition 5.3. (Computational Guarantees for Stochastic Mirror Descent for the SVM problem (6).) Consider applying the Stochastic Mirror Descent algorithm (Algorithm ) to the Support Vector Machine problem (6) using the reference function (7). For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3λ ( ) x + x 0 + x 0 + 8λ n k := ( n w i ) ( ) ( x + x n n w i )) 6ε iterations, where x is the optimal solution of (6). Then it holds that [ E f( x k ) f ] ε, where x k := k+ i=0 xi. Proof: We showed above (using Proposition 5.) that f is -stochastically continuous relative to h defined in (7). Furthermore, applying Lemma 5. it follows that ( D h (x, x 0 ) λ 4 x x 0 ( n ) ( x + x 0 + x 0 λ ( x ) + w i + x 0 ) n ) + w i x x 0. 3n n 8

19 The proof is finished by substituting these values into the computational guarantee of Corollary 4.. Intersection of Ellipsoids Problem (IEP). Consider the problem of computing a point x R m in the intersection of n ellipsoids, namely: x Q := Q Q Q n, (8) where Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a given symmetric positive semi-definite matrix, i =,..., n. This problem can be cast as a second-order cone optimization problem, and hence can be tackled using interior-point methods. However, interior-point methods are typically only effective when the dimensions m and/or n are of moderate size. On the other hand, another way to tackle the problem is to use a first-order method to solve the unconstrained problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, (9) and notice that f(x) 0 x Q, and Q f 0. However, the objective function f in (9) is both non-differentiable and non-lipschitz, and so it falls outside of the scope of optimization problems for which traditional first-order methods are applicable. Let us see how we can use the machinery of relative continuity to tackle this problem. Let σ := max 0 i n A i where A i is the spectral radius of A i, let ρ := max 0 i n A i b i and let γ := max 0 i n b i. Notice that for any x and i =,..., n, we have g i (x) := A i x + b i = f i (x) where f i (x) is the i th term in the objective function of (9). Since g(x) f(x) if and only if g(x) is a convex combination of the active gradients f i (x) (see Danskin s Theorem, Proposition B. in [3]), it follows for any g(x) f(x) that g(x) max 0 i n A ix + b i max 0 i n A i x + b T i A i x + b i σ x + ρ x + γ. Therefore we have g(x) p ( x ), where p (α) = σα + ρα + γ is a quadratic function of α, which is a polynomial in α of degree r =. It follows from Proposition 5. that f is -continuous relative to the reference function h(x) := σ 4 x 4 + ρ 3 x 3 + γ x. (30) Proposition 5.4. (Computational Guarantees for Deterministic Mirror Descent for the IEP problem (9)). Consider applying the Deterministic Mirror Descent algorithm (Algorithm ) to the Ellipsoid Intersection Problem (9) using the reference function (30), where σ := max 0 i n A i and A i is the spectral radius of A i, ρ := max 0 i n A i b i and γ := max 0 i n b i. For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3σ ( x + x 0 + x0 ( ) + 4ρ x + x 0 ) ) + 6γ k := iterations, where x is any optimal solution of (9). Then it holds that where x k := k+ i=0 xi. This problem was suggested by Nesterov []. 6ε f( x k ) f ε, 9

20 Proof: We showed above (using Proposition 5.) that f is -continuous relative to h(x) = σ 4 x 4 + ρ 3 x 3 + γ x. Furthermore, applying Lemma 5. it follows that D h(x, x 0 ) σ 4 x x 0 ( x + x 0 + x 0 ) + ρ 3 x x 0 ( x + x 0 ) + γ x x 0. The proof is finished by substituting these values into the computational guarantee of Corollary 4.3. Acknowledgement The author would like to express his gratitude to Robert M. Freund for thoughtful discussions that helped motivate this work, for commenting on earlier drafts of this paper, and for advising on the presentation and positioning of this paper. The author also wishes to thank Yurii Nesterov for encouraging the author s work on this topic, and for pointing out the application of IEP. Appendix: Finite Radius Bound for SVM Here we derive an upper bound on the norm of an optimal solution of the SVM problem (6). Proposition { 5.5. The optimal solution to the SVM problem (6) lies in the ball B (0, R) for R = min n nλ w i, } /λ. Proof: For convenience define A i := y i w i for i =,..., n. Then we can re-write the SVM problem as the following constrained optimization problem: min s,x n et s + λ x s.t. s + Ax e s 0. Let π and β be the multipliers on the inequality constraints above. imply, among other things, that the optimal solution x must satisfy: Then the KKT conditions π + β = n e λx = A T π where π 0 and β 0. Define π = nπ. Then 0 π e and λ x = A T π = n AT π n n A i = n n w i, which proves the first term in the definition of R. Also, we have λ x f(x ) f(0) =, thus x { /λ. Therefore x min n nλ w i, } /λ, which finishes the proof. 0

21 References [] H.H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Mathematics of Operations Research 4 (06), no., [] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters 3 (003), no. 3, [3] D. Bertsekas, Nonlinear programming, Athena Scientific, Belmont, MA, 999. [4] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends R in Machine Learning 8 (05), no. 3-4, [5] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research 0 (009), no. Dec, [6] F. Hanzely and P. Richtarik, Randomized methods for minimizing relatively smooth functions, Tech. report, 07. [7] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an O(/t) convergence rate for the projected stochastic subgradient method, arxiv preprint arxiv:.00 (0). [8] H. Lu, R. M. Freund, and Y. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM Journal on Optimization 8 (08), no., [9] A. Nedić and S. Lee, On stochastic subgradient mirror-descent algorithm with weighted averaging, SIAM Journal on Optimization 4 (04), no., [0] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, Wiley, New York, 983. [] Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, Boston, 003. [], private communication, (06). [3] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, Proceedings of the 4th International Conference on Machine learning, ACM, 007, pp [4] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, Tech. report, May, 008. [5] Q. Van Nguyen, Forward-backward splitting with Bregman distances, Vietnam Journal of Mathematics 45 (07), no. 3, [6] J. Yu, S.V.N. Vishwanathan, S. Günter, and N.N. Schraudolph, A quasi-newton approach to nonsmooth convex optimization problems in machine learning, Journal of Machine Learning Research (00), no. Mar,

22 [7] Y. Zhou, Y. Liang, and L. Shen, A unified approach to proximal algorithms using Bregman distance, Tech. report, 06.

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to