Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
|
|
- Sophie Pope
- 5 years ago
- Views:
Transcription
1 Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order methods for non-smooth (stochastic or deterministic) convex optimization assumes that the objective function is uniformly Lipschitz continuous with parameter M f. However, in many settings the non-differentiable convex function f is not uniformly Lipschitz continuous for example (i) the classical support vector machine (SVM) problem, (ii) the problem of minimizing the maximum of convex quadratic functions, and even (iii) the univariate setting with f(x) := max{0, x} + x. Herein we develop a notion of relative continuity that is determined relative to a user-specified reference function h (that should be computationally tractable for algorithms), and we show that many non-differentiable convex functions are relatively continuous with respect to a correspondingly fairly-simple reference function h. We also similarly develop a notion of relative stochastic continuity for the stochastic setting. We analyze two standard algorithms the (deterministic) mirror descent algorithm and the stochastic mirror descent algorithm for solving optimization problems in these new settings, providing the first computational guarantees for instances where the objective function is not uniformly Lipschitz continuous. This paper is a companion paper for non-differentiable convex optimization to the recent paper by Lu, Freund, and Nesterov, which developed analogous results for differentiable convex optimization. Introduction The usual approach to developing and analyzing first-order methods for non-differentiable convex optimization (which we review shortly) assumes that the objective function is uniformly Lipschitz continuous, in both deterministic and stochastic settings. However, in many settings the objective function f is not uniformly Lipschitz continuous. For example, consider the Support Vector Machine problem (SVM) for binary classification in machine learning, whose optimization formulation is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, MIT Department of Mathematics and MIT Operations Research Center, 77 Massachusetts Avenue, Cambridge, MA 039 (haihao@mit.edu). The author s research is supported by AFOSR Grant No. FA and the MIT-Belgium Université Catholique de Louvain Fund.
2 where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees for SVM. Another example is the the problem of computing a point x R m in the intersection of n ellipsoids, which can be tackled via the optimization problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, where the i th ellipsoid is Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a symmetric positive semi-definite matrix, i =,..., n. Observe that the objective function f is both non-differentiable and non-lipschitz, and so it falls outside of the scope of standard classes of optimization problems for which first-order methods are guaranteed to work. Nevertheless, using the machinery developed in this paper, we will show in Section 5.3 how to solve both of these problems using deterministic or stochastic Mirror Descent. In this paper we develop a general theory and algorithmic constructs that overcome the drawbacks in the usual analyses of first-order methods that are grounded on restricted notions of uniform Lipschitz continuity. Here we develop a notion of relative continuity with respect to a given convex reference function h, a notion which does not require the specification of any particular norm indeed h need not be strongly (or even strictly) convex. Armed with relative continuity, we demonstrate the capability to solve a more general class of non-differentiable convex optimization problems (without uniform Lipschitz continuity) in both deterministic and stochastic settings. This paper is a companion for non-differentiable convex optimization to our predecessor paper [8] for differentiable convex optimization. In [8], with a very similar philosophy, we developed the notion of relative smoothness and relative strong convexity with respect to a given convex reference function. In that paper we demonstrated the capability to solve a more general class of differentiable convex optimization problems (without uniform Lipschitz continuous gradients), and we also demonstrated linear convergence results for a Primal Gradient Scheme when the objective function f is both smooth and strongly convex. There are some concurrent works on smooth optimization sharing a similar spirit to [8]. Bauschke, Bolte, and Teboulle [] presents a similar definition of relative smoothness as in [8] and analyzes the convergence of Mirror Descent Algorithm, although their algorithm and convergence complexity depend on a symmetry measure of the Bregman distance. Zhou et al. [7] discusses a unified proof of Mirror Descent and the Proximal Point Algorithm under a similar assumption of relative smoothness. Nguyen [5] develops similar ideas on analyzing Mirror Descent in a Banach space. A more detailed discussion comparing these related works is also presented in [8]. More recently, Hanzely and Richtarik [6] develop stochastic algorithms for the relatively smooth optimization setting. In Section we review the traditional set-up for Mirror Descent in both the deterministic and stochastic settings. In Section 3 we introduce our notion of relative continuity in both the deterministic and stochastic settings, together with some relevant properties. In Section 4 we prove computational guarantees associated with the Mirror Descent and Stochastic Mirror Descent
3 algorithms under relative continuity. In Section 5 we show constructively how our ideas apply to a large class of non-differentiable and non-lipschitz convex optimization problems that are not otherwise solvable by traditional first-order methods. Also in Section 5 we analyze computational guarantees associated with Mirror Descent and Stochastic Mirror Descent for the Intersection of Ellipsoids Problem (IEP) and also the Support Vector Machine (SVM) problem. Notation. denotes a given norm in R n and denotes the usual dual norm on the dual space. x := x T x denotes the Euclidean (inner product) norm, where x T means the transpose of the vector x, and B (c, r) := {x R n : x c r}. A denotes the l (spectral) norm of a matrix A. The inner product, specifically denotes the dot inner product in the underlying vector space. For a conditional random variable s(x) given x, E[s(x) x] denotes the conditional expectation of s(x) given x. Traditional Mirror Descent The optimization problem of interest is: P : min x f(x) s.t. x Q, where Q R n is a closed convex set and f : Q R is a convex function that is not necessarily differentiable. There are very many deterministic and stochastic first-order methods for tackling (), see for example [4], [], [9] and the references therein. Virtually all such methods are designed to solve () when the objective function f satisfies a uniform Lipschitz continuity condition on Q, which in the deterministic setting is (essentially) equivalent to the condition that there exists a constant M f < for which: g(x) M f for all x Q and g(x) f(x), () where f(x) is the subdifferential of f at x (i.e., the collection of subgradients of f at x), is a given norm on R n, and denotes the usual dual norm. Here we use g(x) to denote an assignment of a subgradient (or an oracle call thereof) at x, and so g(x) is not a function nor is it a point-to-set map. Another useful functional notion is strong convexity: f is (uniformly) µ f -strongly convex for some µ f > 0 if f(y) f(x) + g(x), y x + µ f y x for all x, y Q and g(x) f(x). (3) Nedić and Lee [9] obtain improved convergence guarantees for the stochastic mirror descent algorithm under strong convexity, for example. (). Deterministic Setting Let us now recall the Mirror Decent Algorithm (see [0] and []), which is also referred to as the prox subgradient method when interpreted in the space of primal variables. Mirror Descent employs a 3
4 differentiable convex prox function h to define a Bregman distance: D h (y, x) := h(y) h(x) h(x), y x for all x, y Q. (4) The Bregman distance is used in the computation of the Mirror Descent update: { } x i+ arg min f(x i ) + g(x i ), x x i + x Q t i D h (x, x i ), where {t i } is the sequence of step-sizes for the scheme. A formal statement of the Mirror Descent Algorithm is presented in Algorithm. The traditional set-up requires that h is -strongly convex with respect to the given norm, and in this set-up one can prove that after k iterations it holds for any x Q that: min 0 i k f(xi ) f(x) M f i=0 t i + D h(x, x 0 ) i=0 t i, (5) which leads to an O(/ k) sublinear rate of convergence using an appropriately chosen step-size sequence {t i }, see []. Algorithm Deterministic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex function. At iteration i : Perform Updates. Compute g(x i ) f(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(xi ) + g(x i ), x x i + t i D h (x, x i )}. (6) Notice in (6) by construction that the update requires the capability to solve instances of a linearized subproblem (which we denote by LS ) of the general form: LS : x new arg min{ c, x + h(x)}, (7) x Q for suitable iteration-specific values of c. Indeed, (6) is an instance of (7) with c = t i g(x i ) h(x i ) at iteration i. It is especially important to note that the Mirror Descent update (6) is somewhat meaningless absent the capability to efficiently solve (7), a fact which we return to later. In the usual design and implementation of Mirror Descent for solving (), one attempts to specify the norm and the -strongly convex prox function h in consideration of properties of the feasible domain Q while also ensuring that the LS subproblem (7) is efficiently solvable. Notice that the Mirror Descent Algorithm (Algorithm ) itself does not require the traditional set-up that h be -strongly convex for some particular norm; rather this requirement is part of the traditional analysis. As we will see, we can instead analyze Mirror Descent by considering the intrinsic ways that f and D h (, ) are related functionally, in a manner that is constructive in terms of actual algorithm design and implementation. Furthermore, this is in the same spirit as was done in the predecessor paper [8]. 4
5 . Stochastic Setting For some convex functions, computing an exact subgradient at x Q may be expensive or even intractable, but sampling a random stochastic estimate of a subgradient at x, which we denote by g(x), may be easy. We say that g(x) is an unbiased stochastic subgradient if E [ g(x) x] f(x). The usefulness of a stochastic subgradient methodology is easily seen in the context of machine and statistical learning problems. A prototypical learning problem is to compute an approximate solution of the following empirical loss minimization problem: min x f(x) := n n j= f j(x) s.t. x Q, where f j is a non-differentiable convex loss function associated with sample j, for j =,..., n data samples. When n 0, the standard subgradient method needs to evaluate n subgradients in order to compute a subgradient of f, which can be prohibitively expensive. A typical alternative is to compute a stochastic subgradient. Letting x i denote the i th iterate, at iteration i a single sample index j is drawn uniformly and independently on {,..., n}, and then a subgradient g f j (xi ) is computed that is used to define g(x i ) := g. This stochastic subgradient is then used in place of a subgradient at iteration i. Notice that by construction g(x i ) is a conditional random variable given x i, and g(x i ) is an unbiased stochastic subgradient, namely E[ g(x i ) x i ] f(x i ). A stochastic version of Mirror Descent is presented in Algorithm. The structure of Stochastic Mirror Descent is identical to that of Mirror Descent, the only difference being that the stochastic estimate of a subgradient g(x i ) replaces the exact subgradient g(x i ) in Algorithm. Algorithm Stochastic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex differentiable function. At iteration i : Perform Updates. Compute a stochastic subgradient g(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(x i ) + g(x i ), x x i + t i D h (x, x i )}. (8) A standard condition that is required in the traditional convergence analysis for Stochastic Mirror Descent (as well as other stochastic first-order methods) is that there exists G f > 0 for which: E[ g(x) x] G f, for any x Q. (9) For notational convenience, we say that f is G f -stochastically continuous if (9) holds. In [9], Nedić and Lee developed convergence results for Stochastic Mirror Descent (Algorithm ). Under the conditions that (i) f is G f -stochastically continuous, (ii) h is a differentiable and µ h -strongly convex function on Q, and (iii) Q is a closed bounded set, Nedić and Lee ([9] equation (7)) show the following convergence result using step-sizes t i = µh D max G f (i+) : [ ] E f( x k ) f 3G f Dmax µ h (k + ), (0) 5
6 where x k := i=0 t i i=0 t ix i and D max := max x,y Q D h (x, y). Furthermore, if also (a) f is µ f -strongly convex, and (b) h is L h -smooth, Nedić and Lee ([9] Theorem ) show that with step-sizes t i = L h µ f (i+) it holds that: where ˇx k := (i+)(i+) i=0 (i + )xi. [ ] E f(ˇx k ) f G f L h µ f (k + )µ h, () 3 Relative Continuity In this section we introduce our definition of relative continuity of a function f actually two different definitions one for the deterministic and another for the stochastic setting. The starting point is a reference function h which is a given differentiable convex function on Q that is used to construct the usual Bregman distance D h (, ) (4), and that is used as part of the Mirror Descent update (6). However, we point out for emphasis that unlike the traditional set-up there are no assumptionss on h (such as strong or strict convexity). 3. Deterministic Setting Consider the objective function f of (). We define relative continuity of f relative to the reference function h using the Bregman distance D h (, ) of h as follows. Definition 3.. f is M-relative continuous with respect to the reference function h on Q if for any x, y Q, x y, and g(x) f(x), it holds that g(x) M D h (y, x) y x. () (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, and the relative continuity condition () becomes g(x) M, which corresponds to the standard definition of Lipschitz continuity () for the l -norm.) We can rewrite () as g(x) M D h(y, x), (3) y x which states that the square of the norm of any subgradient is bounded by the ratio of the Bregman distance D h (y, x) to y x. The following proposition presents the key property of an M-relative continuous function that is used in the proofs of results to follow. 6
7 Proposition 3.. (Key property of M-relative continuity) If f is M-relative continuous with respect to the reference function h, then for any t > 0 it holds for all x, y Q and g(x) f(x) that: t D h(y, x) + g(x), y x + tm 0. (4) Proof: If f is M-relative continuous with respect to h, then for any t > 0 it follows that g(x), y x g(x) y x M D h (y, x) tm + D h(y, x) t where the last inequality is an application of the arithmetic-geometric mean inequality. The proof follows by rearranging terms. The key property (4) is what is used in the proofs of results herein, so we could define M-relative continuity using (4) instead of (). Furthermore, (4) is independent of any norm structure, and so is attractive for its generality. However, we use the definition () because it leads to easy verification of M-relative continuity in practical instances as is shown in Section 5. The following proposition presents some scaling and additivity properties of relative continuity. Proposition 3.. Additivity of Relative Continuity. If f is M-relative continuous with respect to h, then for any α > 0, f is M α -relative continuous with respect to α h.. If f is M-relative continuous with respect to h, then for any α > 0, αf is M-relative continuous with respect to α h. 3. If f j is M-relative continuous with respect to h j for j =,..., n, then n j= f j is nm-relative continuous with respect to n j= h j. 4. If f j is M j -relative continuous with respect to h j for j =,..., n, then for α j > 0 and M > 0 it Proof: holds that n j= α jf j is nm-relative continuous with respect to n j=. It holds that Let x, y Q, x y, and g(x) f(x). which establishes the result. g(x) M D h (y, x) y x = M α Dα h(y, x) y x, α j βj, h j with β j := M M j.. Notice that g(x) is a subgradient of f(x) if and only if αg(x) is a subgradient of αf(x), whereby which establishes the result. αg(x) = α g(x) αm D h (y, x) y x = M D α h(y, x), y x 3. Any subgradient of n j= f j at x can be written as n j= g j(x) where g j (x) f j (x) for j =,..., n (see Theorem B. of [3]). From the triangle inequality and the relative continuity of 7
8 f j we have: n g j (x) j= n g j (x) j= M ( n j= ) D hj (y, x) y x nm ( ) n j= D h j (y, x) y x = nm Dh + +h n (y, x) y x, where the third inequality is an application of the l /l -norm inequality applied to the n-tuple ( D h (y, x),..., D hn (y, x)). 4. It follows from part (.) that α j f j is M j -continuous relative to α j h j. Thus α j f j is also M- continuous relative to α j h βj j from part (.), whereby the proof is finished by utilizing part (3.). We also make use of the notion of relative strong convexity which was introduced in [8], and is used here in some of the convergence guarantee analyses. Definition 3.. f is µ-strongly convex relative to h on Q if there is a scalar µ 0 such that for any x, y int Q and any g(x) f it holds that f(y) f(x) + g(x), y x + µd h (y, x). (5) In [8] it was shown that the notion of relative strong convexity embodied in Definition 3. is the natural way to define strong convexity in the context of mirror descent and similar algorithms, and leads to linear convergence of mirror descent in the smooth setting. In the non-smooth setting, we will show in Theorem 4. that relative strong convexity improves the convergence of mirror descent from O(/ k) to O(/k). 3. Stochastic Setting For x Q, let g(x) denote a random stochastic estimate of a subgradient of f at x. Extending the definition of relative continuity from the deterministic setting, we define stochastic relative continuity as follows. Definition 3.3. f is G-stochastically-relative continuous with respect to the reference function h on Q for some G > 0 if f together with the oracle to compute a stochastic subgradient satisfies:. Unbiasedness property: E[ g(x) x] f(x), and. Boundedness property: E[ g(x) x] G D h (y,x) y x for all x, y Q and x y. (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, whereby the stochastically-relative continuity boundedness property becomes E[ g(x) x] G for all x Q, which corresponds to the standard condition (9) for the l -norm.) 8
9 For x Q, define M(x) := g(x) max y Q,y x y x Dh (y, x). (6) Notice for a given x that max y Q,y x y x Dh (y,x) is a deterministic quantity, and therefore M(x) is a conditional random variable (given x) that is defined on the same probability space as g(x). Clearly, if f is G-stochastically-relative continuous, we have by the boundedness property that for any x Q E[ M(x) x] = E[ g(x) x] max y Q,y x Exactly as in the deterministic setting, we have: y x D h (y, x) G. (7) Proposition 3.3. If f is G-stochastically-relative continuous with respect to the reference function h on Q, then for any t > 0 it holds for all x, y Q and any stochastic subgradient estimate g(x) that: t D h(y, x) + g(x), y x + t M (x) 0. Proof: For any t > 0, we have g(x), y x g(x) y x M(x) D h (y, x) t M(x) + D h(y, x) t and the proof is finished by rearranging terms., 4 Computational Analysis for Stochastic Mirror Descent and (Deterministic) Mirror Descent In this section we present computational guarantees for Stochastic Mirror Descent (Algorithm ) for minimizing a convex function f that is G-stochastically-relative continuous with respect to a given reference function h. We also present computational guarantees for (deterministic) Mirror Descent (Algorithm ) when f is M-relative continuous with respect to a reference function h, which follows as a special case of the stochastic setting. We begin by recalling the standard Three-Point Property for optimization using Bregman distances: Lemma 4.. (Three-Point Property (Tseng [4])) Let φ(x) be a convex function, and let D h (, ) be the Bregman distance for h. For a given vector z, let z + := arg min x Q {φ(x) + D h(x, z)}. Then φ(x) + D h (x, z) φ(z + ) + D h (z +, z) + D h (x, z + ) for all x Q. 9
10 Let us denote the (primitive) random variable at the i th iteration of the Stochastic Mirror Descent Algorithm (Algorithm ) by γ i, i.e., γ i is the random variable that determines the (stochastic) subgradient g(x i ) at iterate x i in the Stochastic Mirror Descent Algorithm. Then x i+ is computed according to the update of the Stochastic Mirror Descent Algorithm, whereby x i+ is a random variable which depends on all previous values γ 0,..., γ i and we denote this string of random variables by ξ i := {γ 0,..., γ i }. The following theorem states convergence guarantees for the Stochastic Mirror Descent Algorithm in terms of expectation. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm) Consider the Stochastic Mirror Descent Algorithm (Algorithm ) with given step-size sequence {t i }. If f is G-stochastically-relative continuous with respect to h for some G > 0, then the following inequality holds for all k and x Q: where x k := i=0 t i i=0 t ix i. ] E ξk [f( x k ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i, (8) Proof: First notice that f(x i ) + g(x i ), x x i = f(x i ) + E γi [ g(x i ) x i ], x x i = f(x i ) + E γi [ g(x i ), x x i x i ] f(x i ) + E γi [ g(x i ), x i+ x i + t i D h (x i+, x i ) + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) + E γi [ M(x i ) t i + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ), where the first equality uses the unbiasedness of g(x), the second equality is because of linearity, the first inequality is from the Three-Point Property with φ(x) = t i g(x i ), x x i, the second inequality uses Proposition 3.3, and the last inequality uses (7). Since also f(x) f(x i ) + g(x i ), x x i from the definition of a subgradient, we have from (9): f(x) f(x i ) + g(x i ), x x i f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ). Taking expectation with respect to ξ i on both sides of the above inequality yields: f(x) E ξi [f(x i )] G t i + t i E ξi [D h (x, x i+ )] t i E ξi [D h (x, x i )] (0) Now rearrange and multiply through by t i to yield: t i E ξi [f(x i ) f(x)] G t i + E ξi [D h (x, x i )] E ξi [D h (x, x i+ )]. (9) 0
11 Summing up the above inequality over i and noting that D h (x, x k+ ) 0 we arrive at: G k i=0 t i + D h(x, x 0 ) i=0 t ie ξi [f(x i ) f(x)] = E k ξk i=0 t i[f(x i ) f(x)] ( k i=0 t i) E ξk [ f( x k ) f(x) ], where the last inequality uses the convexity of f. Dividing by i=0 t i completes the proof. Remark 4.. As a direct consequence of (), we obtain the following result which is similar to the deterministic setting (5): [ ] E ξk min 0 i k f(xi ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i. () Theorem 4. implies the following high-probability result using a simple Markov bound. Corollary 4.. Let x be an optimal solution of (). Under the conditions of Theorem 4., for any δ > 0 it holds that: [ ] P f( x k ) f δ Proof: Using the Markov inequality, we have: [ ] P f( x k ) f δ E [ f( x k ) f ] δ G k i=0 t i + D h(x, x 0 ) δ i=0 t i G k i=0 t i + D h(x, x 0 ) δ i=0 t i. Similar to the case of traditional analysis of stochastic mirror descent, the Stochastic Mirror Descent Algorithm (Algorithm ) leads to an O( ε ) convergence guarantee (in expectation) by using an appropriate step-size sequence {t i } as the next corollary shows. Corollary 4.. Under the conditions of Theorem 4., for a given ε > 0 suppose that the step-sizes are set to: t i := ε G for all i. Then within G D h (x, x 0 ) k := iterations of the Stochastic Mirror Descent Algorithm it holds that: [ ] E f( x k ) f ε, where x is any optimal solution of (). Proof: Substituting the values of t i in (8) yields the result directly. ε
12 Remark 4.. Similar to the standard stochastic gradient descent scheme, the step-size rule in Corollary 4. leads to the optimal rate of convergence provided by the bound in Theorem 4.. Remark 4.3. Let us now compare these results to related results of Nedić and Lee [9]. In order to 9G attain an ε-optimality gap, [9] proved a bound of f Dmax 4µ h iterations, which follows by rearranging ε (0). In addition to not requiring Lipschitz continuity of f, our bound does not require that h be strongly convex. We also do not require boundedness of the feasible region; and in most settings D h (x, x 0 ) D max even when D max < +. Furthermore, even in the setting of (0), it holds that: G = E [ g(x) x ] y x max y Q,y x D h (y, x) E [ g(x) x ] µ h G f µ h, where the first inequality utilizes the strong convexity (in the standard sense) of h, and the second inequality is due to the assumption that f is G f -stochasticlly continuous (in the standard sense). Thus we see that the bound in Corollary 4. improves on the bound in [9]. In the case when f is also µ f -strongly convex relative to h (see Definition 3.), we obtain an O( k ) convergence guarantee in expectation, which is also similar to the traditional case of stochastic gradient descent. This is shown in the next result. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm under Strong Convexity relative to h) Consider the Stochastic Mirror Descent Algorithm (Algorithm ). If f is G-stochastically-relative continuous with respect to h for some G > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i = µ(i+), then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. ] E ξk [f(ˆx k ) f G µ(k + ), Proof: For any x Q it follows from the definition of µ-strong convexity (5) that Combining the above inequality with (9) yields Substituting t i = µ(i+) f(x) f(x i ) + g(x i ), x x i + µd h (x, x i ). f(x) f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] ( t i µ)d h (x, x i ). and multiplying by i in the above inequality yields: i ( f(x i ) f(x) ) G i µ(i+) + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ) G µ + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ). Taking expectation over ξ i and summing up the above inequality over i then yields ( k ) k i E ξk [f(ˆx k ) f(x)] i ( E ξi [f(x i )] f(x) ) kg ( µ ) µ k(k+) E ξk [D h (x, x k+ )] kg µ,
13 where the first inequality uses the convexity of f and the observation that E ξi f(x i ) = E ξk f(x i ) for i k. Taking x = x where x is an optimal solution of (), the proof is completed by noticing i = k(k+). Remark 4.4. It may not be easy to find cases when the objective function is both G-stochasticallyrelative continuous and is µ-relatively strongly convex relative to h. However, as long as these properties are satisfied along the path of iterates or around the minimum, one can achieve the faster convergence of Theorem 4.. Remark 4.5. Let us also compare the computational guarantee of Theorem 4. to the results in Nedić and Lee [9]. In order to attain an ε-optimality gap, [9] proved the bound (). First notice that we do not require either that f is uniformly Lipschitz continuous or that h is strongly convex in the traditional sense, or that h is uniformly smooth. However, even if these requirements hold, it follows from Remark 4.3 that G G f µ h, and it also holds that: D f (x, y) µ f x y µ f L h D h (x, y), where the first inequality utilizes that f is µ f strongly convex and the second inequality h is L h smooth in the standard sense. Thus f is at least µ = µ f L h -strongly convex relative to h (this follows by applying Proposition. in [8]). Therefore, even under the stronger requirements of [9], Theorem 4. improves on the corresponding result in [9]. We end this section with a discussion of the deterministic setting, namely the (Deterministic) Mirror Descent Algorithm (Algorithm ). Suppose that there is no stochasticity in the computation of subgradients. We can cast this as an instance of the Stochastic Mirror Descent Algorithm (Algorithm ) wherein g(x) = g(x) f(x) for all x Q. In this case relative stochastic continuity (Definition 3.3) is equivalent to relative continuity (Definition 3.) with the same constant. Thus deterministic Mirror Descent is a special case of Stochastic Mirror Descent, and we have the following computational guarantees as special cases of the stochastic case. Theorem 4.3. (Convergence Bound for Deterministic Mirror Descent Algorithm) Consider the (Deterministic) Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0, then for all k and x Q the following inequality holds: f( x k ) f(x) M i=0 t i + D h(x, x 0 ) i=0 t i, where x k := i=0 t i i=0 t ix i. Corollary 4.3. Under the conditions of Theorem 4.3, for a given ε > 0 suppose that the step-sizes are set to: t i := ε M for all i. Then within M D h (x, x 0 ) k := ε 3
14 iterations of Deterministic Mirror Descent it holds that: f( x k ) f ε, where x k := i=0 t i i=0 t ix i, and x is any optimal solution of (). Theorem 4.4. (Convergence Bounds for Deterministic Mirror Descent with Strong Relative Convexity) Consider the Deterministic Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i =, then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. µ(i+) f(ˆx k ) f M µ(k + ), 5 Specifying a Reference Function h with Relative Continuity for Mirror Descent Let us discuss using either deterministic or stochastic Mirror Descent (Algorithm or Algorithm ) for solving the optimization problem () with objective function f that is M-relative continuous or G-stochastically-relative continuous (respectively) with respect to the reference function h. In order to efficiently execute the update step in Algorithm and/or Algorithm we need h to be such that the linearization subproblem LS (7) is efficiently solvable for any given c. Therefore, in order execute Mirror Descent for solving () using Algorithm or Algorithm, we need to specify a differentiable convex reference function h that has the following two properties: (i) f is M-relative continuous (or G-stochastically-relative continuous) with respect to h on Q for M (or G) that is easy to determine, and (ii) the linearization subproblem LS (7) has a solution, and the solution is efficiently computable. We now discuss quite broadly how to construct such a reference function h with these two properties when g(x) is bounded by a polynomial in x. 5. Deterministic Setting Suppose that g(x) p r ( x ) for all x Q and all g(x) f(x), where p r (α) = r i=0 a iα i is an r-degree polynomial of α whose coefficients {a i } are nonnegative. Let h(x) := r i=0 a i i+ x i+. 4
15 Then the following proposition states that f is -relative continuous with respect to h. This implies that no matter how fast the subgradient of f grows polynomially as x, f is relatively continuous with respect to the simple reference function h, even though f does not exhibit uniform Lipschitz continuity. Proposition 5.. f is -continuous relative to h(x) = r i=0 a i i+ x i+. Proof: Let h i (x) = distance, we have Notice that i+ x i+, then h(x) = r i=0 a ih i (x), and by the definition of Bregman D hi (y, x) = i+ y i+ i+ x i+ x i x, y x = ( i+ y i+ + (i + ) x i+ (i + ) x i x, y ). y i+ + (i + ) x i+ (i + ) x i x, y = ( y i+ + i x i+ i+ ) ( x i y + i+ x i x + y x, y ) 0 + i+ ( x i x + y x, y ) = i+ x i y x, where the inequality above is an application of arithmetic-geometric mean inequality a λ b λ λa + ( λ)b with a = x i+, b = y i+, and λ = i i+. Thus we have ( r r ) D h (y, x) = a i D hi (y, x) y x a i x i. () Therefore i=0 g(x) p r ( x ) = r i=0 i=0 a i x i D h(y, x) y x which shows that f is -relative continuous with respect to h., Solving the linearization subproblem (7). Let us see how we can solve the linearization subproblem (7) for this class of optimization problems. The linearization subproblem (7) can be written as LS : r min c, x + x R n i=0 a i i+ x i+, (3) and the first-order optimality condition is simply: ( r ) c + a i x i x = 0, (4) i=0 5
16 whereby x = θc for some scalar θ 0, and it remains to simply determine the value of the nonnegative scalar θ. In the case when c = 0 we have x = 0 satisfies (4), so let us examine the case when c 0, in which case from (4) θ must satisfy: r a i c i θ i+ = 0, i=0 which implies that θ is the unique positive root of a univariate polynomial monotone in θ 0. For r {0,,, 3} this root can be computed in closed form. Otherwise the root can be computed efficiently (up to machine precision) using any suitable root-finding method. Remark 5.. We can incorporate a simple set constraint x Q in problem (3) provided that we can easily compute the Euclidean projection on Q. In this case, the linearization subproblem (7) can be converted to a -dimensional convex optimization problem, see Appendix A. of [8] for details. 5. Stochastic Setting In the stochastic setting, the stochastic subgradient g(x) is a conditional random variable for a given x. Suppose that E [ g(x) x ] p r ( x ) for all x Q, where p r (α) = r i=0 a iα i is an r-degree polynomial whose coefficients {a i } are nonnegative. Let h(x) := and similar to the deterministic case we have: r i=0 a i i+ x i+, Proposition 5.. f is -stochastically continuous relative to h(x) = r i=0 a i i+ x i+. Proof: For any x, y Q with x y we have: E g(x) x ] p r ( x ) = r i=0 a i x i D h(y, x) y x, (5) where the last inequality follows from (), and thus f is -stochastically continuous relative to h. Solving the linearization subproblem (7). The linear optimization subproblem is identical in structure to that in the deterministic case and so can be solved as discussed at the end of Section Relative Continuity for instances of SVM and IEP Here we examine in detail the two motivating examples stated in the Introduction, namely the Support Vector Machine (SVM) problem, and the Intersection of Ellipsoids Problem (IEP). We first prove the following lemma, which presents upper bounds on the Bregman distances D h (, ) for h(x) = 3 x 3 and h(x) = 4 x 4. 6
17 Lemma 5... Let h(x) := 3 x 3. Then D h(y, x) 3 y x ( y + x ).. Let h(x) := 4 x 4. Then D h(y, x) 4 y x ( y + x + x ). Proof:. D h (y, x) = ( 3 y 3 + x 3 3 x x, y ) ( y 3 + x 3 3 x x, y y y, x + y x x x, y + y x ) 3 = 3 y x ( y + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality twice, and the final equality is from simplifying and combining terms.. D h (y, x) = 4 ( y x 4 4 x x, y ) 4 ( y x 4 4 x x, y + 4 x y 4 x, y ) = 4 y x ( y + x + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality once, and the final equality is from simplifying and combining terms. Support Vector Machine (SVM). The Support Vector Machine (SVM) is an important supervised learning model for binary classification in machine learning. The SVM optimization problem for binary classification is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, (6) where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of the hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees in the analysis of (6). Researchers have developed various approaches to overcome this limitation. For example, [5] introduced a splitting subgradient-type method, where the basic idea is to split the loss function and the regularization terms. [6] introduced a quasi-newton method, where they do not need to worry about the unbounded subgradient. Another approach is to a priori constrain x to lie in an l -ball of radius R for R sufficiently large so that the ball contains the optimal solution, and to project onto this ball at each iteration; in this approach f is Lipschitz continuous in the amended feasible region, see [3]. Indeed, one can show using quadratic 7
18 optimization optimality conditions that it suffices to set R = min{ ( n λ n w ) i, /λ} (see Appendix 5.3) wherein the modulus of Lipschitz continuity in the amended feasible region is at most M n n w i + min{ λ, n n w i }. Furthermore, in [7] the authors show that if the initial point lies within a suitably chosen large ball, then Stochastic Subgradient Descent with a small step-size ensures in expectation that all iterates lie in the large ball, which then ensures that the norms of all subgradients are bounded in expectation. Let us see how we can directly use the constructs of relative continuity to tackle the SVM problem with a suitably designed version of Stochastic Mirror Descent without any projection step to a ball. We can rewrite the objective function of (6) as f(x) = n n f j (x), j= where f j (x) = max{0, y j x T w j } + λ x. We consider computing a stochastic estimate of the subgradient of f by using a single sample index j drawn randomly from {,..., n}, namely g(x) f j (x) where j is drawn uniformly at random from {,..., n}. Then g(x) (λ x + w j ), whereby ( n ) E[ g(x) x] λ x + λ w i x n + n w i, n and notice that the right-hand side is a polynomial in x of degree r =. If we choose the reference function h as ( n ) ( n ) h(x) := λ 4 x 4 + λ w i x 3 + w i x, (7) 3n n it follows from the Proposition 5. that f is -stochastically continuous relative to h(x). Proposition 5.3. (Computational Guarantees for Stochastic Mirror Descent for the SVM problem (6).) Consider applying the Stochastic Mirror Descent algorithm (Algorithm ) to the Support Vector Machine problem (6) using the reference function (7). For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3λ ( ) x + x 0 + x 0 + 8λ n k := ( n w i ) ( ) ( x + x n n w i )) 6ε iterations, where x is the optimal solution of (6). Then it holds that [ E f( x k ) f ] ε, where x k := k+ i=0 xi. Proof: We showed above (using Proposition 5.) that f is -stochastically continuous relative to h defined in (7). Furthermore, applying Lemma 5. it follows that ( D h (x, x 0 ) λ 4 x x 0 ( n ) ( x + x 0 + x 0 λ ( x ) + w i + x 0 ) n ) + w i x x 0. 3n n 8
19 The proof is finished by substituting these values into the computational guarantee of Corollary 4.. Intersection of Ellipsoids Problem (IEP). Consider the problem of computing a point x R m in the intersection of n ellipsoids, namely: x Q := Q Q Q n, (8) where Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a given symmetric positive semi-definite matrix, i =,..., n. This problem can be cast as a second-order cone optimization problem, and hence can be tackled using interior-point methods. However, interior-point methods are typically only effective when the dimensions m and/or n are of moderate size. On the other hand, another way to tackle the problem is to use a first-order method to solve the unconstrained problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, (9) and notice that f(x) 0 x Q, and Q f 0. However, the objective function f in (9) is both non-differentiable and non-lipschitz, and so it falls outside of the scope of optimization problems for which traditional first-order methods are applicable. Let us see how we can use the machinery of relative continuity to tackle this problem. Let σ := max 0 i n A i where A i is the spectral radius of A i, let ρ := max 0 i n A i b i and let γ := max 0 i n b i. Notice that for any x and i =,..., n, we have g i (x) := A i x + b i = f i (x) where f i (x) is the i th term in the objective function of (9). Since g(x) f(x) if and only if g(x) is a convex combination of the active gradients f i (x) (see Danskin s Theorem, Proposition B. in [3]), it follows for any g(x) f(x) that g(x) max 0 i n A ix + b i max 0 i n A i x + b T i A i x + b i σ x + ρ x + γ. Therefore we have g(x) p ( x ), where p (α) = σα + ρα + γ is a quadratic function of α, which is a polynomial in α of degree r =. It follows from Proposition 5. that f is -continuous relative to the reference function h(x) := σ 4 x 4 + ρ 3 x 3 + γ x. (30) Proposition 5.4. (Computational Guarantees for Deterministic Mirror Descent for the IEP problem (9)). Consider applying the Deterministic Mirror Descent algorithm (Algorithm ) to the Ellipsoid Intersection Problem (9) using the reference function (30), where σ := max 0 i n A i and A i is the spectral radius of A i, ρ := max 0 i n A i b i and γ := max 0 i n b i. For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3σ ( x + x 0 + x0 ( ) + 4ρ x + x 0 ) ) + 6γ k := iterations, where x is any optimal solution of (9). Then it holds that where x k := k+ i=0 xi. This problem was suggested by Nesterov []. 6ε f( x k ) f ε, 9
20 Proof: We showed above (using Proposition 5.) that f is -continuous relative to h(x) = σ 4 x 4 + ρ 3 x 3 + γ x. Furthermore, applying Lemma 5. it follows that D h(x, x 0 ) σ 4 x x 0 ( x + x 0 + x 0 ) + ρ 3 x x 0 ( x + x 0 ) + γ x x 0. The proof is finished by substituting these values into the computational guarantee of Corollary 4.3. Acknowledgement The author would like to express his gratitude to Robert M. Freund for thoughtful discussions that helped motivate this work, for commenting on earlier drafts of this paper, and for advising on the presentation and positioning of this paper. The author also wishes to thank Yurii Nesterov for encouraging the author s work on this topic, and for pointing out the application of IEP. Appendix: Finite Radius Bound for SVM Here we derive an upper bound on the norm of an optimal solution of the SVM problem (6). Proposition { 5.5. The optimal solution to the SVM problem (6) lies in the ball B (0, R) for R = min n nλ w i, } /λ. Proof: For convenience define A i := y i w i for i =,..., n. Then we can re-write the SVM problem as the following constrained optimization problem: min s,x n et s + λ x s.t. s + Ax e s 0. Let π and β be the multipliers on the inequality constraints above. imply, among other things, that the optimal solution x must satisfy: Then the KKT conditions π + β = n e λx = A T π where π 0 and β 0. Define π = nπ. Then 0 π e and λ x = A T π = n AT π n n A i = n n w i, which proves the first term in the definition of R. Also, we have λ x f(x ) f(0) =, thus x { /λ. Therefore x min n nλ w i, } /λ, which finishes the proof. 0
21 References [] H.H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Mathematics of Operations Research 4 (06), no., [] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters 3 (003), no. 3, [3] D. Bertsekas, Nonlinear programming, Athena Scientific, Belmont, MA, 999. [4] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends R in Machine Learning 8 (05), no. 3-4, [5] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research 0 (009), no. Dec, [6] F. Hanzely and P. Richtarik, Randomized methods for minimizing relatively smooth functions, Tech. report, 07. [7] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an O(/t) convergence rate for the projected stochastic subgradient method, arxiv preprint arxiv:.00 (0). [8] H. Lu, R. M. Freund, and Y. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM Journal on Optimization 8 (08), no., [9] A. Nedić and S. Lee, On stochastic subgradient mirror-descent algorithm with weighted averaging, SIAM Journal on Optimization 4 (04), no., [0] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, Wiley, New York, 983. [] Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, Boston, 003. [], private communication, (06). [3] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, Proceedings of the 4th International Conference on Machine learning, ACM, 007, pp [4] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, Tech. report, May, 008. [5] Q. Van Nguyen, Forward-backward splitting with Bregman distances, Vietnam Journal of Mathematics 45 (07), no. 3, [6] J. Yu, S.V.N. Vishwanathan, S. Günter, and N.N. Schraudolph, A quasi-newton approach to nonsmooth convex optimization problems in machine learning, Journal of Machine Learning Research (00), no. Mar,
22 [7] Y. Zhou, Y. Liang, and L. Shen, A unified approach to proximal algorithms using Bregman distance, Tech. report, 06.
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationA First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems
A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationInterior-Point Methods for Linear Optimization
Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationRadial Subgradient Descent
Radial Subgradient Descent Benja Grimmer Abstract We present a subgradient method for imizing non-smooth, non-lipschitz convex optimization problems. The only structure assumed is that a strictly feasible
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationCubic regularization of Newton s method for convex problems with constraints
CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationDesign and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016
Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)
More informationAdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods Robert M. Freund Paul Grigas Rahul Mazumder June 29, 2013 Abstract Boosting methods are highly popular and effective
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationPavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017
Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationComplexity bounds for primal-dual methods minimizing the model of objective function
Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationStructural and Multidisciplinary Optimization. P. Duysinx and P. Tossings
Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationPrimal-dual subgradient methods for convex problems
Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different
More informationLecture 2: Convex Sets and Functions
Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationNonsymmetric potential-reduction methods for general cones
CORE DISCUSSION PAPER 2006/34 Nonsymmetric potential-reduction methods for general cones Yu. Nesterov March 28, 2006 Abstract In this paper we propose two new nonsymmetric primal-dual potential-reduction
More informationStochastic model-based minimization under high-order growth
Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively
More informationA SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS
A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationPrimal/Dual Decomposition Methods
Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationDISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder
011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationLEARNING IN CONCAVE GAMES
LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationUNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems
UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationConvex Optimization on Large-Scale Domains Given by Linear Minimization Oracles
Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research
More informationSolving Dual Problems
Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationA DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson
204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,
More informationThe Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1
October 2003 The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1 by Asuman E. Ozdaglar and Dimitri P. Bertsekas 2 Abstract We consider optimization problems with equality,
More informationLecture 2: Subgradient Methods
Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationarxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date
Noname manuscript No. (will be inserted by the editor) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm Yi Zhou Yingbin Liang Lixin Shen arxiv:1503.05601v3 [math.oc] 17 Dec 2017 Received:
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationVariational Inequalities. Anna Nagurney Isenberg School of Management University of Massachusetts Amherst, MA 01003
Variational Inequalities Anna Nagurney Isenberg School of Management University of Massachusetts Amherst, MA 01003 c 2002 Background Equilibrium is a central concept in numerous disciplines including economics,
More informationLecture 3. Optimization Problems and Iterative Algorithms
Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex
More informationBISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption
BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationA projection algorithm for strictly monotone linear complementarity problems.
A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu
More informationCharacterizing Robust Solution Sets of Convex Programs under Data Uncertainty
Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty V. Jeyakumar, G. M. Lee and G. Li Communicated by Sándor Zoltán Németh Abstract This paper deals with convex optimization problems
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationThe Steepest Descent Algorithm for Unconstrained Optimization
The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationAn Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods
An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This
More informationOptimisation in Higher Dimensions
CHAPTER 6 Optimisation in Higher Dimensions Beyond optimisation in 1D, we will study two directions. First, the equivalent in nth dimension, x R n such that f(x ) f(x) for all x R n. Second, constrained
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationA Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints
A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering
More informationQuiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006
Quiz Discussion IE417: Nonlinear Programming: Lecture 12 Jeff Linderoth Department of Industrial and Systems Engineering Lehigh University 16th March 2006 Motivation Why do we care? We are interested in
More informationEfficient Methods for Stochastic Composite Optimization
Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June
More informationCHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.
1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationGradient methods for minimizing composite functions
Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum
More information