A Greedy Framework for First-Order Optimization
|
|
- Maurice Henry
- 5 years ago
- Views:
Transcription
1 A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA Jonathan Huggins Department of EECS Massachusetts Institute of Technology Cambridge, MA 039 Introduction. Recent work has shown many connections between conditional gradient and other first-order optimization methods, such as herding [3] and subgradient descent []. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive arg-mins of a sequence of surrogate functions. Using a standard online learning analysis based on Bregman divergences, we are able to demonstrate an O(/T ) convergence rate for all algorithms in this class. Setup. More concretely, pose that we are given a function L : U Θ R defined by L(u, ) = h(u) + u, R() () and wish to find the saddle point L def = min max L(u, ). () u We should think of u as the primal variables and as the dual variables; we will assume throughout that h and R are both convex. We will also abuse notation and define L(u) def = max L(u, ); we can equivalently write L(u) as L(u) = h(u) + R (u), (3) where R is the Fenchel conjugate of R. Note that L(u) is a convex function. Moreover, since R R is a one-to-one mapping, for any convex function L and any decomposition of L into convex functions h and R, we get a corresponding two-argument function L(u, ). Primal algorithm. Given the function L(u, ), we define the following optimization procedure, which will generate a sequence of points (u, ), (u, ),... converging to a saddle point of L. First, take a sequence of weights α, α,..., and for notational convenience define s= û t = α su s s= α s s= and ˆt = α s s s= α. s Then the primal boosted mirror descent (PBMD) algorithm is the following iterative procedure:. u arg min u h(u). t arg max Θ, u t R() = R (u t ) 3. u t+ arg min u h(u) + ˆ t, u = h ( ˆ t ) As long as h is strongly convex, for the proper choice of α t we obtain the bound (see Corollary ): L(û T, ) L + O(/T ). (4) Θ As an example, pose that we are given a γ-strongly convex function f: that is, f(x) = γ x + f 0 (x), where f 0 is convex. Then we let h(x) = γ x, R (x) = f 0 (x), and obtain the updates: Both authors contributed equally to this work.
2 . u = 0. t = f 0 (u t ) 3. u t+ = γ ˆ t = s= αs f0(us) γ s= αs We therefore obtain a variant on subgradient descent where u t+ is a weighted average of the first t subgradients (times a step size /γ). Note that these are the subgradients of f 0, which are related to the subgradients of f by f 0 (x) = f(x) γx. Dual algorithm. We can also consider the dual form of our mirror descent algorithm (dual boosted mirror descent, or DBMD):. arg min R(). u t arg min u h(u) + t, u = h ( t ) 3. t+ arg max Θ, û t R() = R (û t ) Convergence now hinges upon strong convexity of R rather than h, and we obtain the same O(/T ) convergence rate (see Corollary 4): L(û T, ) L + O(/T ). (5) Θ { 0 : u K An important special case is h(u) =, where K is some compact set. Also let : u K R = f, where f is an arbitrary strongly convex function. Then we are minimizing f over the compact set K, and we obtain the following updates, which are equivalent to (standard) conditional gradient or Frank-Wolfe:. = f(0). u t arg min u K t, u 3. t+ = f(û t ) Our notation is slightly different from previous presentations in that we use linear weights (α t ) instead of geometric weights (often denoted ρ t, as in []). However, under the mapping α t = ρ t / t s= ( ρ s), we obtain an algorithm equivalent to the usual formulation. Optimality. The solutions generated by conditional gradient have an attractive sparsity property: the solution after the t-th iteration is a convex combination of t extreme points of the optimized space. In [4] it is shown that conditional gradient is asymptotically optimal amongst algorithms with sparse solutions. This suggests that the O(/T ) convergence rate obtained in this paper cannot be improved without further assumptions. For further discussion of the properties of conditional gradient, see Jaggi [4]. Primal vs. dual algorithms. It is worth taking a moment to contrast the primal and dual algorithms. Note that both algorithms have a primal convergence guarantee (i.e. both guarantee that û T is close to optimal). The sense in which the latter algorithm is a dual algorithm is that it hinges on strong convexity of the dual, as opposed to strong convexity of the primal. If we care about dual convergence instead of primal convergence, the only change we need to make is to take ˆ T at the end of the algorithm instead of û T. (This can be seeing by defining the loss function L (, u) def = L(u, ), which inverts the role of u and but yields the same updates as above.) Discussion. Our framework is intriguing in that it involves a purely greedy minimization of surrogate loss functions (alternating between the primal and dual), and yet is powerful enough to capture both (a variant of) subgradient descent and conditional gradient descent, as well as a host of other first-order methods. An example of a more complex first-order method captured by our framework is the low-rank SDP solver introduced by Arora, Hazan, and Kale []. Briefly, the AHK algorithm seeks to minimize
3 m j= (Tr(AT j X) b j) subject to the constraints X 0 and Tr(X) ρ. We can then define { 0 : X 0 and Tr(X) ρ h(x) = (6) : else and R (X) = m j= (Tr(AT j X) b j ). (7) Note that this decomposition is actually a special case of the conditional gradient decomposition above, and so we obtain the updates m X t+ argmin X 0,T r(x) ρ [Tr( ˆX ] j t ) b j Tr( j X), (8) j= whose solution turns out to be ρvv T, where v is the top singular vector of the matrix m j= [Tr( ˆX ] j t ) b j A j. This example serves both to illustrate the flexibility of our framework and to highlight the interesting fact that the Arora-Hazan-Kale SDP algorithm is a special case of conditional gradient (to get the original formulation in [], we need to replace the function x with x + log x +, where x + = max(x, 0)). It is worth noting that the analysis in [] appears to be tighter in their special case than the analysis we give here. We plan to investigate this phenomenon in future work. Related work. A generalized Frank-Wolfe algorithm has also been proposed in [5], which considers first-order convex surrogate functions more generally. Algorithm 3 of [5] turns out to be equivalent to a line-search formulation of the algorithm we propose in this paper. However, the perspective is quite different. While both algorithms share the intuition of trying to construct a proximal Frank-Wolfe algorithm, the general framework of [5] studies convex upper bounds, rather than the convex lower bounds of this paper (although for the proximal gradient surrogates that they suggest using, their Frank-Wolfe algorithm does implicitly use lower bounds). Another difference is that their analysis focuses on the Lipschitz properties of the objective function, whereas we use the more general notion of strong convexity with respect to an arbitrary norm. We believe that this geometric perspective is useful for tailoring the proximal function to specific applications, as demonstrated in our discussion of q-herding below. Finally, the saddle point perspective on Frank-Wolfe is not present in [5], although they do discuss the general idea of saddle-point methods in relation to convex surrogates. q-herding. In addition to unifying several existing methods, our framework allows us to extend herding to an algorithm we call q-herding. Herding is an algorithm for constructing pseudosamples that match a specified collection of moments from a distribution; it was originally introduced by Welling [6] and was shown to be a special case of conditional gradient by Bach et al. [3]. Let M X be the probability simplex over X. Herding can be cast as trying to minimize E µ [φ(x)] φ over µ M X, for a given φ : X R d and φ R d. As shown in [3], the herding updates are equivalent to DBMD with h(µ) 0 and R() = T φ +. The implicit assumption in the herding algorithm is that φ(x) is bounded. We are able to construct a more general algorithm that only requires φ(x) p to be bounded for some p. This q-herding algorithm works by taking R() = T φ + q q q, where p + q =. In this case our convergence results imply that E µ [φ(x)] φ p p decays at a rate of O(/T ) (proof to appear in an extended version of this paper). Convergence results. We end by stating our formal convergence results. Proofs are given in the Appendix. For the primal algorithm (PBMD) we have: Theorem. Suppose that h is strongly convex with respect to a norm and let r =. Then, for any u, L(û, ) L(u, ) + r αt+a t A. (9) t+ This is actually a variant of their algorithm, which we present for ease of exposition. t= 3
4 Corollary. Under the hypotheses of Theorem, for α t = we have and for α t = t we have L(û, ) Similarly, for the dual algorithm (DBMD) we have: L(u, ) + r (log(t ) + ). (0) T L(û, ) L(u, ) + 8r T. () Theorem 3. Suppose that R is strongly convex with respect to a norm and let r = u u. Then, for any u, L(û, ) L(u, ) + r Corollary 4. Under the hypotheses of Theorem 3, for α t = we have and for α t = t we have t= L(û, ) L(u, ) + r (log(t ) + ) T αt+a t A. () t+ (3) L(û, ) L(u, ) + 8r T (4) Thus, a step size of α t = t yields the claimed O(/T ) convergence rate. Acknowledgements. Thanks to Cameron Freer, Simon Lacoste-Julien, Martin Jaggi, Percy Liang, Arun Chaganty, and Sida Wang for helpful discussions. Thanks also to the anonymous reviewer for suggesting improvements to the paper. JS was ported by the Hertz Foundation. JHH was ported by the U.S. Government under FA9550--C-008 and awarded by the Department of Defense, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 3 CFR 68a. References [] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In FOCS, pages , 005. [] Francis Bach. Duality between subgradient and conditional gradient methods. arxiv.org, November 0. [3] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the Equivalence between Herding and Conditional Gradient Algorithms. In ICML. INRIA Paris - Rocquencourt, LIENS, March 0. [4] Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML, pages , 03. [5] Julien Mairal. Optimization with First-Order Surrogate Functions. In ICML, 03. [6] Max Welling. Herding dynamical weights to learn. In ICML,
5 Appendix: Convergence Proofs We now prove the convergence results given above. Throughout this section, assume that α,..., α T is a sequence of real numbers and that A t = s= α s. We further require that A t > 0 for all t. Also recall that the Bregman divergence is defined by D f (x x ) def = f(x ) f(x ), x x f(x ). Our proofs hinge on the following key lemma: Lemma 5. Let z,..., z T be vectors and let f(x) be a strictly convex function. Define ẑ t to be t A t s= α sz s. Let x,..., x T be chosen via x t+ = arg min x f(x) + ẑ t, x. Then for any x we have {α t (f(x t ) + z t, x t )} f(x ) + ẑ t, x + A t D f (x t x t+ ). t= Proof. First note that, if x 0 = arg min f(x) + z, x, then f(x 0 ) = z. Now note that so we have α t z t = A t ẑ t A t ẑ t (5) = A t f(x t+ ) + A t f(x t ), (6) {α t (f(x t ) + z t, x t )} (7) t= = = {α t f(x t ) + A t f(x t ) A t f(x t+ ), x t } (8) t= {α t f(x t ) A t f(x t+ ), x t x t+ } (9) t= f(x T + ), x T + (0) = {A t f(x t ) A t f(x t+ ), x t x t+ A t f(x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} + (f(x T + ) + ẑ T, x T + ) (3) t= {A t D f (x t x t+ )} + (f(x ) + ẑ T, x ). (4) t= Dividing both sides by completes the proof. We also note that D f (x t x t+ ) = D f (ẑ t+ z t ), where f (z) = x z, x f(x). This form of the bound will often be more useful to us. Another useful property of Bregman divergences is the following lemma which relates strong convexity of f to strong smoothness of f : Lemma 6. Suppose that D f (x x) x x for some norm and for all x, x. (In this case we say that f is strongly convex with respect to.) Then D f (y y) y y for all y, y. t= 5
6 We require a final porting proposition before proving Theorem. Proposition 7 (Convergence of PBMD). Consider the updates t arg max, u t R() and u t+ arg min u h(u) + ˆ s, u. Then we have L(û T, ) L(u, ) + A t D h (u t u t+ ). (5) Proof. Note that L(u t, t ) = max L(u t, ) by construction. Also note that, if we invoke Lemma 5 with f = h and z t = t then we get the inequality t= α t L(u t, t ) t= t= α t L(u, t ) + Combining these together, we get the string of inequalities ( ) L(û T, ) = L α t u t, as was to be shown. t= α t L(u t, ) t= α t L(u t, t ) t= t= α t L(u, t ) + L(u, ) + Proof of Theorem. By Lemma 6, we have D h (u t u t+ ) = D h (ˆ t+ ˆ t ) A t D h (u t u t+ ). (6) t= A t D h (u t u t+ ) t= A t D h (u t u t+ ), t= ˆ t+ ˆ t = s t α s s s t+ s t α α s s s s t+ α s = α t+ α s s α t+ t+ A t A t+ A t+ s t α t+ α s s + α t+ t+ A t A t+ A t+ r αt+ A. t+ s t It follows that t= A t D h (u t u t+ ) r t= αt+a t A. t+ 6
7 Proof of Corollary. If we let α t =, then A t = t and α t+ At = t A (t+) t+ t. We therefore get r t= α t+a t A t+ r T t r (log(t ) + ). (7) T + t= If we let α t = t, then A t = t(t+) and α t+ At = (t+) t(t+) A (t+) t+ (t+) = t(t+) (t+). We therefore get which completes the proof. r t= α t+a t A t+ 4r T (T + ) t= = 8r T +, (8) The proof of Theorem 3 requires an analagous porting proposition to that of Theorem. Proposition 8 (Convergence of DBMD). Consider the updates u t arg min u h(u) + t, u and t+ arg max, û t R(). Then we have L(û, ) L(u, ) + A t D R ( t t+ ). (9) Proof. If we invoke Lemma 5 with f = R and z t = u t, then for all we get the inequality Re-arranging yields t= t= α t L(u t, t ) α t L(u t, ) + Now, we have the following string of inequalities: ( ) L(û, ) = L α t u t, = t= α t L(u t, ) t= t= t= t= t= t= α t L(u t, ) (30) t= A t D R ( t t+ ). t= α t L(u t, t ) (3) t= A t D R ( t t+ ). (3) t= α t L(u t, t ) + A t D R ( t t+ ) t= α t inf u L(u, t) + α t L(u, t ) + A t D R ( t t+ ) t= A t D R ( t t+ ) t= α t L(u, ) + 7 A t D R ( t t+ ) t=
8 as was to be shown. = L(u, ) + A t D R ( t t+ ), t= Proofs of Theorem 3 and Corollary 4. The proofs are identical to those for PBMD. 8
Conditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationConvex Optimization Conjugate, Subdifferential, Proximation
1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationLecture 23: Conditional Gradient Method
10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationLimited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives
Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song
More informationLecture 6 : Projected Gradient Descent
Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationarxiv: v1 [math.oc] 10 Oct 2018
8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationLecture 19: Follow The Regulerized Leader
COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationPrimal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming
Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. OPTIM. Vol. 5, No. 1, pp. 115 19 c 015 Society for Industrial and Applied Mathematics DUALITY BETWEEN SUBGRADIENT AND CONDITIONAL GRADIENT METHODS FRANCIS BACH Abstract. Given a convex optimization
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationConvex Repeated Games and Fenchel Duality
Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationA New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction
A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization
More informationIteration-complexity of first-order penalty methods for convex programming
Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationFidelity and trace norm
Fidelity and trace norm In this lecture we will see two simple examples of semidefinite programs for quantities that are interesting in the theory of quantum information: the first is the trace norm of
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationLecture 24 November 27
EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationLecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods
Stat260/CS294: Spectral Graph Methods Lecture 20-04/07/205 Lecture: Local Spectral Methods (3 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide
More informationComplexity bounds for primal-dual methods minimizing the model of objective function
Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationLearning with Submodular Functions: A Convex Optimization Perspective
Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationEE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1
EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex
More informationDual methods for the minimization of the total variation
1 / 30 Dual methods for the minimization of the total variation Rémy Abergel supervisor Lionel Moisan MAP5 - CNRS UMR 8145 Different Learning Seminar, LTCI Thursday 21st April 2016 2 / 30 Plan 1 Introduction
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationLecture 7: Semidefinite programming
CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 7: Semidefinite programming This lecture is on semidefinite programming, which is a powerful technique from both an analytic and computational
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationA Simple Algorithm for Nuclear Norm Regularized Problems
A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationA Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds
Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More information1 Quantum states and von Neumann entropy
Lecture 9: Quantum entropy maximization CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: February 15, 2016 1 Quantum states and von Neumann entropy Recall that S sym n n
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationOn the von Neumann and Frank-Wolfe Algorithms with Away Steps
On the von Neumann and Frank-Wolfe Algorithms with Away Steps Javier Peña Daniel Rodríguez Negar Soheili July 16, 015 Abstract The von Neumann algorithm is a simple coordinate-descent algorithm to determine
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationOnline Optimization with Gradual Variations
JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute
More informationNonlinear Optimization for Optimal Control
Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]
More informationarxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date
Noname manuscript No. (will be inserted by the editor) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm Yi Zhou Yingbin Liang Lixin Shen arxiv:1503.05601v3 [math.oc] 17 Dec 2017 Received:
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationTheory of Convex Optimization for Machine Learning
Theory of Convex Optimization for Machine Learning Sébastien Bubeck 1 1 Department of Operations Research and Financial Engineering, Princeton University, Princeton 08544, USA, sbubeck@princeton.edu Abstract
More informationLecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem
Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R
More informationA. Derivation of regularized ERM duality
A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as
More information