Hamiltonian Descent Methods
|
|
- Cora Woods
- 5 years ago
- Views:
Transcription
1 Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2
2 The problem Unconstrained minimization of a differentiable f : R d R, x = arg min x R d f(x). This talk: convex f Paper: also briefly consider non-convex f.
3 Optimization and Machine Learning Imbalance in our pipelines. Time spent designing models, but success constrained by optimizer. Have we discovered all the useful optimizers? If there s any doubt that optimization is a bottleneck for neural nets, consider how many architectural innovations were ways to get SGD to work better.
4 Optimization and Computer Science The computational complexity classes of convex optimization characterized by the information required of f [7]. 0th-order Local black-box evaluation of... f(x) 1st-order f(x), f(x) = ( ) f(x) x (n) 2nd-order f(x), f(x), 2 f(x) = ( ) 2 f(x) x (n) x (m)
5 Optimization and Computer Science Study rate of convergence of iterative methods. 0 log(f(xi) f(x )) sub-linear linear super-linear 30 iteration i Distinguish between fast linear and slow sub-linear convergence.
6 Gradient descent E.g. gradient descent Iterates with step size ɛ > 0 is a first-order method, x i+1 = x i ɛ f(x i ). x (2) x 0 x 1 x (1)
7 When is gradient descent fast? f C 2 is strongly convex & smooth iff µ, L > 0, x R d, µi 2 f(x) LI; Gradient descent on smooth & strongly convex f with ɛ = L 1 has fast linear convergence, f(x i ) f(x ) O ( ( 1 µ ) ) i L
8 Smoothness & strong convexity important? Lower bound. Nemirovski & Yudin [7] show, iter. first-order method, iteration i, smooth convex f, such that convergence is slow, f(x i ) f(x ) Ω(i 2 ). Similar for non-smooth strongly convex.
9 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. 4 L f (x) 2 µ x
10 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
11 Power functions Power functions useful as study case of idealized convex functions. f(x) = x b b b 1, x R 4 b = 4/3 4 b = 2 4 b = 4 f (x) x x x
12 Power functions Smooth & strongly convex iff b = 2. Lojasiewicz inquality [4], real analytic functions can be bounded by power functions at their zero locus. If f : R d R is real analytic & convex with unique minium x, then K R d compact, b 1 and µ > 0, such that x K, f(x) f(x ) µ b x x b 2 In general, don t know b.
13 Continuous limit of optimization algorithms To study properties of optimizers consider ɛ 0, Iterates Iterates Iterates x (2) x (2) x (2) x (1) x (1) x (1) E.g. gradient descent iterates approx. solution to gradient flow, x t = f(x t ) with x 0 R d, t 0.
14 Continuous limit of optimization algorithms Fundamental properties revealed by studying solutions x t : [0, ) R d of gradient flow, e.g. a descent method, (f(x t )) = f(x t ), x t = f(xt ), f(x t ) 0.
15 Gradient descent on power functions For f(x) = x b /b, we have (f(x t )) = b x t b 2 f(x t ), so, ( f(x t ) = exp b t Two regimes in b for rate of convergence, 0 ) x s b 2 ds. 1 < b 2 f(x t ) O(exp( λt)) b > 2 f(x t ) Ω(t b b 2 )
16 Gradient descent on power functions Continuous time gradient descent on f(x) = x b /b 0 log f(xt) time t b = 4 b = 2 b = 4/3
17 Gradient descent on power functions Gradient descent with step size ɛ > 0, x i+1 = x i (1 ɛ x i b 2 ), doesn t converge for b < 2 as x i b 2 explodes. 0 log f(xi) iteration i b = 4/3 b = 2 b = 4
18 Gradient descent on power functions Summary of gradient descent with fixed ɛ on power functions. super-linear in continuous time sub-linear in continuous time linear in discrete time b This mirrors lower bounds, although specialized methods can do better in this case.
19 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results.
20 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
21 The question What can be done using the first-order computation of two functions f, k R d R? f(x), f(x), k(p), k(p). k(p), k(p) cheap to compute (e.g., O(d)) to avoid cheating.
22 Proposed methods & key contributions Methods generalize momemtum method [10] to include non-standard kinetic energy k, we call them Hamiltonian descent methods. Linear rates possible for convex functions f that are not smooth & strongly convex. Convergence theory in continuous & discrete time.
23 Gradient descent with momentum Polyak s heavy ball [10] with ɛ, γ > 0: Iterates p i+1 = ɛ f(x i ) + (1 ɛγ)p i x (2) x i+1 = x i + ɛp i+1 Persistent motion helps in narrow valleys. x (1) Heavy ball Gradient descent
24 Gradient descent with momentum Continuous ɛ 0 limit of Polyak s heavy ball, Polyak s heavy ball Continuous heavy ball x i+1 = x i + ɛp i+1 p i+1 = ɛ f(x i ) + (1 ɛγ)p i x t = p t p t = f(x t ) γp t
25 Hamiltonian descent methods Generalize position update of Polyak s heavy ball, Continuous heavy ball x t = p t p t = f(x t ) γp t Continuous Hamiltonian descent x t = k(p t ) p t = f(x t ) γp t Also called conformal Hamiltonian system [6].
26 Hamiltonian descent methods Def. In physics the total energy or Hamiltonian is defined as, H(x, p) = f(x) f(x ) + k(p) If k strictly convex with min. k(0) = 0, then solutions of conformal Hamiltonian systems descend the Hamiltonian, (H(x t, p t )) = γ k(p t ), p t 0
27 Hamiltonian descent methods Dual views on k and f relationship: Given f, design k for fast convergence? Given k, on which class of f is convergence fast? Class of smooth & strongly convex corresponding to quadratic k(p) = p, p /2 not an accident! Develop intuition via one dim. power functions. Let ϕ a (t) = t a /a, f(x) = ϕ b ( x ) k(p) = ϕ a ( p ).
28 Hamiltonian descent on power functions Continuous system becomes, = sgn(p t) p t a sgn(x t ) x t b 1 γp t x t p t momentum p = + position x position x position x
29 Hamiltonian descent on power functions Solutions with a = 2 and b = 2, 1 momentum p position x
30 Hamiltonian descent on power functions Worst case is x t & p t small. To escape, want along p t that (k(p t )) k(p t ), f( k(p t )dt) + γp t Ck(p t ). 4 k(p) 4 f(x) y p x i.e. k(p) ( f) 1 (p).
31 Hamiltonian descent on power functions For power functions, this is 1 a + 1 b 1. 4 sub-linear convergence f(x) = x b /b k(p) = p a /a a linear convergence b We show linear convergence in continuous time iff 1 a + 1 b 1.
32 Hamiltonian descent on power functions Solutions with a = 2 and b = 8 (here 1 a + 1 b < 1) 1 momentum p position x
33 Hamiltonian descent on power functions We study three fixed ɛ discretizations, e.g. first explicit is p i+1 p i ɛ x i+1 x i ɛ = f(x i ) γp i+1 = k(p i+1 ) If k(p) = ϕ a ( p ), all disc. require L > 0, x R, f (x) a a L(f(x) f(x )) If k(p) = ϕ a ( p ), f(x) = ϕ b ( x ), this satisfied if 1 a + 1 b 1.
34 Hamiltonian descent on power functions a linear convergence sub-linear convergence b f(x) = x b /b k(p) = p a /a linear convergence of 1st explicit method linear convergence of 2nd explicit method quadratic suitable for strongly convex and smooth Linear convergence of fixed ɛ discretizations if 1 a + 1 b = 1.
35 Hamiltonian descent on power functions Generalize smoothness & strong convexity to power growth! 4 b = 4/3 4 b = 2 4 b = 4 f (x) x x x Can deal with second derivatives that shrink or explode.
36 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results. Hamiltonian descent can cope with 2 f(x) shrinking or exploding.
37 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
38 Convex Conjugate Def. Given a convex function h : R d R { }, define the convex conjugate h : R d R { } E.g. h(x) = x b b h (p) = sup x R d x, p h(x) = h (p) = p a a h(x) = 1 2 x, Ax = h (p) = a + 1 b = 1 p, A 1 p
39 Choosing k Given f, design k for fast convergence? Good choice of k(p) related to convex conjugate of f c (x) = f(x + x ) f(x ). Assumption A. α (0, 1] such that p R d k(p) α max{f c (p), f c ( p)}
40 Choosing k continuous Theorem. Given f diff. and convex with unique minimum x, k diff. and strictly convex with unique minimum k(0) = 0, α satisfying assumption A, and γ (0, 1). Let λ = (1 γ)γ, 4 then the solutions of the Hamiltonian descent system satisfy, f(x t ) f(x ) O (exp ( λαt))
41 Choosing k discrete Assumption B. All discretizations require first-order assumption, C f,k > 0, for all x, p R d, f(x), k(p) C f,k H(x, p).
42 Choosing k discrete Assumptions C xor D. Explicit discretizations require second-order assumptions on either f or k, Assumption C. D f,k > 0, x R d \ {x }, and p R d, f is twice cont. diff. and k(p), 2 f(x) k(p) D f,k H(x, p) Assumption D. Switch f and k in Assumption C. Under such assumptions, C > 0, s.t. discretizations converge linearly ɛ (0, C], γ (0, 1].
43 Power Kinetic Energies Given k, on which class of f is convergence fast? Def. Given a, A [1, ), define ϕ A a (t) = 1 A (ta + 1) A a 1 A for t [0, ) ϕ A a behaves like ϕ A for large t and ϕ a for small t. Conditions on f given a norm and k as k(p) = ϕ A a ( p ) p = sup p, x x 1
44 Power Kinetic Energies 4 ' A a ( x ) with a =8/7 ' A a ( x ) with a =2 ' A a ( x ) with a =8 ' A a ( x ) x x A =8/7 A =2 A = x
45 Power Kinetic Energies Let b = a a 1 B = A A 1 Assumption A implied by, µ > 0, f(x) f(x ) µϕ B b ( x x ). Implied by strong convexity for b = B = 2.
46 Power Kinetic Energies Assumption B implied by, L > 0, ϕ A a ( f(x) ) L(f(x) f(x )). Implied by smoothness for a = A = 2.
47 Power Kinetic Energies Assumption C for b, B 2 implied by, twice cont. diff. of f and and L > 0, x R d \ {x }, 2 f(x) L 2 ϕ B b ( x x ) Equivalent to smoothness for b = B = 2. Assumption D relies on smoothness of k, so req. twice cont. diff. of.
48 Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) xt 0 f(x) = x 4 /4 k(p) = 3p 4/3 / log f(xt) xt 0 f(x) = x 4 /4 k(p) = p 2 / time t time t
49 Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) xt 0 f(x) = x 4 /4 k(p) = p 8/7 7/
50 Adaptive rates α may improve as (x i, p i ) (x, 0). To capture this, our analysis is extended to capture k(p) α(k(p)) max{f c (p), f c ( p)} for α : [0, ) (0, 1] differentiable convex, non-increasing. Allows us to provide position-independent step-size choice with naturally adaptive rates for B A/(A 1).
51 Relativistic Kinetic Energy Lu et al. [5] study the relativistic kinetic energy for sampling k(p) = p k(p) = p p k(p) 2 is bounded, which improves stability, similar to gradient clipping [9], Adam [3], RMSProp [2], AdaGrad [1].
52 Relativistic Kinetic Energy Relativistic is k(p) = ϕ 1 2 ( p ). Suitable for strongly convex, but possibly non-smooth. Has adaptive rates, α(y) (y + 1) 1 1 B
53 Simulations, f(x) = ϕ 8 2( x ) 40 Gradient descent k(p) = ϕ 1 2( p ) k(p) = ϕ 8/7 2 (p) log f(xi) x iterates xi iteration i
54 Conclusions Theoretical. Lower bounds assuming two first-order oracles? Optimal γ, ɛ?
55 Conclusions Methodological. ks for specific problems of interest? Constrained optimization? Biggest limitation is that designing k requires knowledge of f near minimum. Adaptive methods, e.g. [11]?
56 Thanks to you and my coauthors: Daniel Paulin Yee Whye Teh Brendan O Donoghue Arnaud Doucet
57 [1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul): , [2] Geoffrey Hinton. Neural Networks for Machine Learning. url: Slides of Lecture 6. [3] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, [4] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87 89, [5] X. Lu, V. Perrone, L. Hasenclever, Y. W. Teh, and S. Vollmer. Relativistic Monte Carlo. In Artificial Intelligence and Statistics, pages , [6] R. McLachlan and M. Perlmutter. Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4): , [7] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, [8] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, [9] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages , [10] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, [11] V. Roulet and A. d Aspremont. Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems, pages , 2017.
Day 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationA Conservation Law Method in Optimization
A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationSharpness, Restart and Compressed Sensing Performance.
Sharpness, Restart and Compressed Sensing Performance. Alexandre d Aspremont, CNRS & D.I., Ecole normale supérieure. With Vincent Roulet (U. Washington) and Nicolas Boumal (Princeton U.). Support from
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationAdam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationIntegration Methods and Optimization Algorithms
Integration Methods and Optimization Algorithms Damien Scieur INRIA, ENS, PSL Research University, Paris France damien.scieur@inria.fr Francis Bach INRIA, ENS, PSL Research University, Paris France francis.bach@inria.fr
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationarxiv: v1 [math.oc] 7 Dec 2018
arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationIntroduction to Optimization
Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine
More informationNon-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I
Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationUnderstanding Neural Networks : Part I
TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLecture 25: Subgradient Method and Bundle Methods April 24
IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationConvex Optimization. Problem set 2. Due Monday April 26th
Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationSHARPNESS, RESTART AND ACCELERATION
SHARPNESS, RESTART AND ACCELERATION VINCENT ROULET AND ALEXANDRE D ASPREMONT ABSTRACT. The Łojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationDesign and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016
Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationminimize x subject to (x 2)(x 4) u,
Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationNormalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training
Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationAccelerate Subgradient Methods
Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationOptimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum
Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural
More informationNegative Momentum for Improved Game Dynamics
Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More information