How hard is this function to optimize?

Size: px
Start display at page:

Download "How hard is this function to optimize?"

Transcription

1 How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016

2 Problem minimize f(x) := E[F(x;ξ)] subject to x X. (1) where x F(x;ξ) is convex, X is closed convex set

3 Problem minimize f(x) := E[F(x;ξ)] subject to x X. (1) where x F(x;ξ) is convex, X is closed convex set Two questions: 1. How hard is it to solve problem (1) for that f 2. Can an algorithm(s) do as well as possible for each f?

4 Outline Part 0 Complexity of problems Part I Complexity lower bounds General lower bounds Super-efficiency Part II Toward achievability? Part III Problem geometry and dimensionality

5 Problem Complexity Large literature on guarantees of optimality Wald 1939, Contributions to the theory of statistical estimation and testing hypotheses (minimax complexity) Nemirovski and Yudin 1983, Problem Complexity and Method Efficiency in Optimization (information-based complexity)

6 Problem Complexity Large literature on guarantees of optimality Wald 1939, Contributions to the theory of statistical estimation and testing hypotheses (minimax complexity) Nemirovski and Yudin 1983, Problem Complexity and Method Efficiency in Optimization (information-based complexity) Three main considerations: i. Information oracle (how we get information on problem) ii. Problem class (what problems must the algorithm solve) iii. How to measure error

7 Minimax error and oracles Information oracle How algorithm/procedure receives information about problem Optimization: function value f(x) (zero-order), gradient f(x) (first-order), Hessian 2 f(x) (second-order) Statistics: observations ξi from probability distribution P

8 Minimax error and oracles Information oracle How algorithm/procedure receives information about problem Optimization: function value f(x) (zero-order), gradient f(x) (first-order), Hessian 2 f(x) (second-order) Statistics: observations ξi from probability distribution P Minimax principle Develop algorithm that has best worst case performance (risk) for problem class F R N (F) := inf sup A A N f F error(a,f) where A N is algorithms with N queries, F is problem class

9 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball

10 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball Minimax rate: let x N be output of algorithm A after N steps d inf E[f( x N ) f(x )]. N sup A A N f F (Agarwal et al. 12, Nemirovski & Yudin 83)

11 Example F consist of 1-Lipschitz convex functions on R d Oracle returns f(x)+ε, where E[ε] = 0 and ε 1 Domain X containts l -ball Minimax rate: let x N be output of algorithm A after N steps d inf E[f( x N ) f(x )]. N sup A A N f F (Agarwal et al. 12, Nemirovski & Yudin 83) More generally optimality guarantee: L-Lipschitz convex functions on sets X with diameter D, minimax rate LD/ N

12 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t

13 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t

14 An optimal? algorithm Algorithm: At iteration t Choose random ξ, set g t = F(x t ;ξ i ) Update x t+1 = x t α t g t Theorem (Russians / Hungarians): Let x N = 1 N N t=1 x t and assume D x x 1 2, L 2 E[ g t 2 2 ]. Then E[f( x N ) f(x ) LD N

15 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1]

16 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1] / t -steps t

17 Stochastic gradient descent Let s use stochastic gradient descent to solve minimize x f(x) = 1 2 x2 subject to x [ 1,1] t -steps 1/t-steps 1/ 1/t t

18 A local notion of complexity Minimax complexity R N (F) := inf sup A A N f F error(a,f)

19 A local notion of complexity Minimax complexity R N (F) := inf sup A A N f F error(a,f) Local minimax complexity: Fix function f, and look for hardest local alternative R N (f;f) := sup g F inf max{error(a,f),error(a,g)} A A N (Related ideas in statistics: Donoho & Liu 1987, 1991, Cai & Low 2015)

20 Local minimax complexity for stochastic optimization Noisy subgradient oracle: ξ N(0,σ 2 I d d ), return f (x)+ξ Function class F: convex functions (can restrict) Algorithm A N : all algorithms with N noisy subgradient queries Error metric err : X F R

21 Local minimax complexity for stochastic optimization Noisy subgradient oracle: ξ N(0,σ 2 I d d ), return f (x)+ξ Function class F: convex functions (can restrict) Algorithm A N : all algorithms with N noisy subgradient queries Error metric err : X F R Local minimax complexity ( x A is output of algorithm) R N (f;f) := sup inf max{e f [err( x A,f)],E g [err( x A,g)]}. g F A A N

22 Distances on functions and moduli of continuity Distance for solutions Error must to satisfy exclusion inequality err(x,f) d(f,g) implies err(x,g) d(f,g). Example: how well one or other can be optimized { f d(f 0,f 1 ) := sup δ 0 : 1 (x) f1 +δ implies f 0 (x) f0 +δ f 0 (x) f0 +δ implies f 1 (x) f1 +δ error err(x,f) = f(x) f },

23 Separation f 0 f 1 δ = d(f 0,f 1 ) x : f 1 (x) f 1 +δ

24 Other error metrics Distance for solutions Error must to satisfy exclusion inequality err(x,f) d(f,g) implies err(x,g) d(f,g). Example: distance to optimality X f := argmin x X f(x), Xg := argming(x) x X d(f,g) = dist(x f,x g)

25 Distances on functions We study first-order methods, so define κ(f,g) := sup f (x) g (x) x X

26 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g

27 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g

28 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g Quadratic f(x) = 1 2 x2, ω f (ǫ) = ǫ

29 Modulus of continuity Given solution metric d and function metric κ, modulus of continuity of d with respect to κ at f is Examples: with d(f,g) = x f x g ω f (ǫ) := sup{d(f,g) : κ(f,g) ǫ} g Quadratic f(x) = 1 2 x2, ω f (ǫ) = ǫ Power f(x) = 1 k x k, ω f (ǫ) = ǫ 1 k 1

30 Illustration of modulus of continuity If X R, { ω f (ǫ) = sup x x f : f (x) ǫ } x

31 Main Theorem I Theorem (Chatterjee, D., Lafferty, Zhu): Under Gaussian σ 2 -noise, for (almost) any f ( ) ( ) σ σ cω f R N (f;f) Cω f. N N

32 Main Theorem I Theorem (Chatterjee, D., Lafferty, Zhu): Under Gaussian σ 2 -noise, for (almost) any f ( ) ( ) σ σ cω f R N (f;f) Cω f. N N Modulus of continuity precisely determines rate of convergence

33 Proof intuition If P f = distribution when f is true, P g = distribution when g is true max{e f [err( x,f)],e g [err( x,g)]} 1 2 E f[err( x,f)]+ 1 2 E g[err( x,g)]

34 Proof intuition If P f = distribution when f is true, P g = distribution when g is true max{e f [err( x,f)],e g [err( x,g)]} d(f,g) 2 [1+P g (err( x,f) d(f,g)) P f (err( x,g) d(f,g))]

35 Is this real? cω f ( σ N ) R N (f;f) Cω f ( σ N ).

36 Is this real? ( ) ( ) σ σ cω f R N (f;f) Cω f. N N Is it possible to have a faster algorithm? (Were we too adversarial?) Is it too easy? (Were we not adversarial enough?)

37 You probably cannot be faster Theorem (Chatterjee, D., Lafferty, Zhu): Let an algorithm x N satisfy ( ) σ E f [err( x N,f)] δω f. N Then for ǫ N = σ log 1 δ N and g one of g 1 (x) = f(x) ǫ N x, g 1 (x) = f(x)+ǫ N x, E g [err( x N,g)] cω g σ log 1 δ N.

38 Is the local problem too easy? Not in 1-dimension! Sign-based binary search From interval [a 0,b 0 ], for k = 1,2,... Query T points at x k = 1 2 (a+b) for gradients G 1,...,G T If T t=1 G t > 0, set b = x k otherwise a = x k Proposition (Chatterjee, D., Lafferty, Zhu): After k epochs, define { } I k := x : f (x) σ log(k/δ) T Then w.p. 1 δ, dist(x k,i k ) 2 k b 0 a 0 Õ(1)ω f ( ) σ N

39 Intuition Recall flat set {x : f (x) ǫ}

40 Simulations risk risk 5e-06 5e-05 5e-04 5e-03 risk risk k = t k l = 1.5, k r = 2 risk risk risk risk k = t k l = 1.5, k r = 3 risk risk risk risk k = t k l = 2, k r

41 More achievability Problem: In higher dimensions, we have some issues

42 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon)

43 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon) Intuition: for k = 1,2,..., Perform stochastic gradient descent for T k 2 k iterations with constant stepsize α k Halve stepsize, double iteration length T k+1 = 2T k, run again

44 More achievability Problem: In higher dimensions, we have some issues Note: an epoch-based stochastic gradient descent procedure of Juditsky and Nesterov (2014) is nearly adaptive (result coming soon) Intuition: for k = 1,2,..., Perform stochastic gradient descent for T k 2 k iterations with constant stepsize α k Halve stepsize, double iteration length T k+1 = 2T k, run again One of these will be correct stepsize, and everything later will not allow any movement anyway

45 Part III: Problem Geometry Coming soon... just a handwritten teaser if time.

46 Summary Local notions of minimax risk Worst single alternative Shrinking neighborhoods (suitably or poorly) defined Some optimal algorithms, but work to be done! Some here: more soon...

Local Minimax Complexity of Stochastic Convex Optimization

Local Minimax Complexity of Stochastic Convex Optimization Local Minimax Complexity of Stochastic Convex Optimization Yuancheng Zhu Sabyasachi Chatterjee John Duchi and John Lafferty arxiv:1605.07596v3 [stat.ml] 6 May 016 Department of Statistics Department of

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Minimax rates for Batched Stochastic Optimization

Minimax rates for Batched Stochastic Optimization Minimax rates for Batched Stochastic Optimization John Duchi based on joint work with Feng Ruan and Chulhee Yun Stanford University Tradeoffs Major problem in theoretical statistics: how do we characterize

More information

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on)

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

10. Ellipsoid method

10. Ellipsoid method 10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Query Complexity of Derivative-Free Optimization

Query Complexity of Derivative-Free Optimization Query Complexity of Derivative-Free Optimization Kevin G Jamieson University of Wisconsin Madison, WI 53706, USA kgjamieson@wiscedu Robert D Nowak University of Wisconsin Madison, WI 53706, USA nowak@engrwiscedu

More information

The Inverse Function Theorem via Newton s Method. Michael Taylor

The Inverse Function Theorem via Newton s Method. Michael Taylor The Inverse Function Theorem via Newton s Method Michael Taylor We aim to prove the following result, known as the inverse function theorem. Theorem 1. Let F be a C k map (k 1) from a neighborhood of p

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Estimation, Optimization, and Parallelism when Data is Sparse

Estimation, Optimization, and Parallelism when Data is Sparse Estimation, Optimization, and Parallelism when Data is Sparse John C. Duchi, Michael I. Jordan University of California, Berkeley Berkeley, CA 9470 {jduchi,jordan}@eecs.berkeley.edu H. Brendan McMahan

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Accelerating Nesterov s Method for Strongly Convex Functions

Accelerating Nesterov s Method for Strongly Convex Functions Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Large-scale Machine Learning and Optimization

Large-scale Machine Learning and Optimization 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

arxiv: v2 [math.oc] 8 Oct 2011

arxiv: v2 [math.oc] 8 Oct 2011 Stochastic convex optimization with bandit feedback Alekh Agarwal Dean P. Foster Daniel Hsu Sham M. Kakade, Alexander Rakhlin Department of EECS Department of Statistics Microsoft Research University of

More information

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles

O(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Technical Note July 8, 200 O(/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Yaoliang Yu Xinhua Zhang yaoliang@cs.ualberta.ca xinhua.zhang.cs@gmail.com Department of Computing Science

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by: Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

More information

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Statistical Sparse Online Regression: A Diffusion Approximation Perspective Statistical Sparse Online Regression: A Diffusion Approximation Perspective Jianqing Fan Wenyan Gong Chris Junchi Li Qiang Sun Princeton University Princeton University Princeton University University

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

A New Look at the Performance Analysis of First-Order Methods

A New Look at the Performance Analysis of First-Order Methods A New Look at the Performance Analysis of First-Order Methods Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Yoel Drori, Google s R&D Center, Tel Aviv Optimization without

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information