How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India
|
|
- Sheena Fisher
- 5 years ago
- Views:
Transcription
1 How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington
2 Nonconvex optimization Problem: min x f x f : nonconvex function Applications: Deep learning, compressed sensing, matrix completion, dictionary learning, nonnegative matrix factorization,
3 Gradient descent (GD) [Cauchy 1847] Question How does it perform? x t+1 = x t η f x t
4 Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points
5 Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points Definition ϵ-first order stationary point (ϵ-fosp) : f(x) ϵ
6 Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points Definition ϵ-first order stationary point (ϵ-fosp) : f(x) ϵ Concretely ϵ-fosp in O 1 ϵ 2 iterations [Folklore]
7 How do FOSPs look like?
8 How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP)
9 How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP) Hessian NSD 2 f x 0
10 How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP) Hessian NSD 2 f x 0 Hessian indefinite λ min ( 2 f x ) 0 λ max ( 2 f x ) 0
11 FOSPs in popular problems Very well studied Neural networks [Dauphin et al. 2014] Matrix sensing [Bhojanapalli et al. 2016] Matrix completion [Ge et al. 2016] Robust PCA [Ge et al. 2017] Tensor factorization [Ge et al. 2015, Ge & Ma 2017] Smooth semidefinite programs [Boumal et al. 2016] Synchronization & community detection [Bandeira et al. 2016, Mei et al. 2017]
12 Two major observations FOSPs: proliferation (exponential #) of saddle points Recall FOSP f x = 0 Gradient descent can get stuck near them SOSPs: not just local minima; as good as global minima Recall SOSP f x = 0 & 2 f x 0 Upshot 1. FOSP not good enough 2. Finding SOSP sufficient
13 Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information
14 Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information Question 1 Does perturbed GD converge to SOSP efficiently? In particular, independent of d?
15 Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information Question 1 Does perturbed GD converge to SOSP efficiently? In particular, independent of d? Our result Almost yes, in O polylog(d) ε 2 iterations!
16 Accelerated gradient descent (AGD) [Nesterov 1983] Optimal algorithm in the convex setting Practice: Sutskever et al observed AGD to be much faster than GD Widely used in training neural networks since then Theory: Finds an ϵ-fosp in O 1 ϵ 2 iterations [Ghadimi & Lan 2013] No improvement over GD
17 Question 2: Does essentially pure AGD find SOSPs faster than GD? Our result: Yes, in O polylog(d) ε 1.75 steps compared to O polylog(d) ε 2 for GD Perturbation + negative curvature exploitation (NCE) on top of AGD NCE inspired by Carmon et al Carmon et al and Agarwal et al show this improved rate for a more complicated algorithm Solve sequence of regularized problems using AGD
18 Summary ε-sosp [Nesterov & Polyak 2006] f x ε & λ min 2 f x ε Convergence to SOSPs very important in practice Pure GD and AGD can get stuck near FOSPs (saddle points) Algorithm Paper # Iterations Simplicity Perturbed gradient descent Sequence of regularized subproblems with AGD Ge et al Levy 2016 O poly d ε Jin, Ge, N., Kakade, Jordan 2017 Carmon et al Agarwal et al Perturbed AGD + NCE Jin, N., Jordan 2017 O polylog(d) ε 2 O polylog(d) ε 1.75 O polylog(d) ε 1.75 Single loop Single loop Nested loop Single loop
19 Part I Main Ideas of the Proof of Gradient Descent
20 Setting Gradient Lipschitz: f x f y x y Hessian Lipschitz: 2 f x 2 f y x y Lower bounded: min x f x >
21 How does GD behave? GD step x t+1 x t η f x t Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0
22 f x t How does GD behave? GD step x t+1 x t η f x t small f x t large Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0 x t f x t η f x t f x t+1 SOSP Saddle point f x t+1 f x t η 2 f x t 2
23 f x t How does GD behave? GD step x t+1 x t η f x t small f x t large Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0 x t f x t η f x t f x t+1 SOSP Saddle point f x t+1 f x t η 2 f x t 2
24 How to escape saddle points?
25 Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t
26 Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t Between two perturbations, just run GD!
27 Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t 1. f x t is small 2. No perturbation in last several iterations Between two perturbations, just run GD!
28 How can perturbation help?
29 Key question S set of points around saddle point from where gradient descent does not escape quickly Escape function value decreases significantly How much is Vol S? Vol S small perturbed GD escapes saddle points efficiently
30 Two dimensional quadratic case f x = 1 2 x x λ min H = 1 < 0 0,0 is a saddle point GD: x t+1 = 1 η η x t B 0,0 0,0 S S is a thin strip, Vol S is small
31 Three dimensional quadratic case f x = 1 2 x x B 0,0,0 0,0,0 is a saddle point GD: x t+1 = 1 η η η x t 0,0,0 S S is a thin disc, Vol S is small
32 General case Key technical results S thin deformed disc Vol S is small
33 Two key ingredients of the proof Improve or localize f x t+1 f x t η 2 f x t 2 = f x t η 2 x t x t+1 η 2 x t x t+1 2 2η f x t f(x t+1 ) x 0 x t t 1 2 t i=0 x i x i+1 2 2ηt f x 0 f x t
34 Two key ingredients of the proof Improve or localize Upshot Either function value decreases significantly or iterates do not move much x 0 x t t 1 2 t i=0 x i x i+1 2 2ηt f x 0 f x t
35 Proof idea If GD from either u or w goes outside a small ball, it escapes (function value ) Coupling If GD from both u and w lie in a small ball, use local quadratic approximation of f( ) Show the claim for exact quadratic, and bound approximation error using Hessian Lipschitz property Either GD from u escapes Or GD from w escapes
36 Putting everything together GD step x t+1 x t η f x t f x t large f( ) decreases f x t small Saddle point Perturbation + GD Moves away from SOSP SOSP Perturbation + GD Stays at SOSP
37 Part II Main Ideas of the Proof of Accelerated Gradient Descent
38 Nesterov s AGD Iterate x t & Velocity v t 1. x t+1 = x t + 1 θ v t η f x t + 1 θ v t 2. v t+1 = x t+1 x t Gradient descent at x t + 1 θ v t Challenge Known potential functions depend on optimum x
39 Differential equation view of AGD AGD is a discretization of the following ODE [Su et al. 2015] x ሷ + θ x ሶ + f x = 0 Multiplying by xሶ and integrating from t 1 to t 2 gives us f x t xሶ t2 = f xt xሶ t1 t 2 2 θ න t 1 xሶ t 2 dt Hamiltonian f x t xሶ t 2 decreases monotonically
40 After discretization Iterate: x t and velocity: v t x t x t 1 Hamiltonian f x t + 1 v 2η t 2 decreases monotonically if f not too nonconvex between x t and x t + v t too nonconvex = negative curvature Can increase if f is too nonconvex If the function is too nonconvex, reset velocity or move in nonconvex direction negative curvature exploitation
41 Hamiltonian decrease f between x t and x t + v t Not too nonconvex Too nonconvex AGD step v t large v t small v t+1 = 0 Move in ±v t direction f x t + 1 2η v t 2 decreases
42 Negative curvature exploitation v t small f x t x t x t + v t x t v t One of ±v t directions decreases f x t
43 Hamiltonian decrease f between x t and x t + v t Not too nonconvex Too nonconvex (Negative curvature exploitation) AGD step v t large v t small Need to do amortized analysis v t+1 = 0 f x t + 1 2η v t 2 decreases Move in ±v t direction Enough decrease in a single step
44 Improve or localize f x t η v t+1 2 f x t + 1 2η v t 2 θ 2η v t 2 T 1 t=0 x t+1 x t 2 2η θ f x 0 f(x T ) Approximate locally by a quadratic and perform computations Precise computations are technically challenging
45 Summary Simple variations to GD/AGD ensure efficient escape from saddle points Fine understanding of geometric structure around saddle points Novel techniques of independent interest Some extensions to stochastic setting
46 Open questions Is NCE really necessary? Lower bounds recent work by Carmon et al. 2017, but gaps between upper and lower bounds Extensions to stochastic setting Nonconvex optimization for faster algorithms
47 Thank you! Questions?
Mini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationarxiv: v1 [math.oc] 9 Oct 2018
Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationA Conservation Law Method in Optimization
A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu
More informationThird-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic
More informationConvergence of Cubic Regularization for Nonconvex Optimization under KŁ Property
Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University
More informationGradient Descent Can Take Exponential Time to Escape Saddle Points
Gradient Descent Can Take Exponential Time to Escape Saddle Points Simon S. Du Carnegie Mellon University ssdu@cs.cmu.edu Jason D. Lee University of Southern California jasonlee@marshall.usc.edu Barnabás
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationarxiv: v2 [math.oc] 5 Nov 2017
Gradient Descent Can Take Exponential Time to Escape Saddle Points arxiv:175.1412v2 [math.oc] 5 Nov 217 Simon S. Du Carnegie Mellon University ssdu@cs.cmu.edu Jason D. Lee University of Southern California
More informationPerformance Surfaces and Optimum Points
CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationSVRG Escapes Saddle Points
DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationStochastic Cubic Regularization for Fast Nonconvex Optimization
Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationarxiv: v2 [math.oc] 1 Nov 2017
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationarxiv: v4 [math.oc] 11 Jun 2018
Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationarxiv: v4 [math.oc] 24 Apr 2017
Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute
More informationarxiv: v1 [cs.lg] 6 Sep 2018
arxiv:1809.016v1 [cs.lg] 6 Sep 018 Aryan Mokhtari Asuman Ozdaglar Ali Jadbabaie Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 0139 Abstract aryanm@mit.edu
More informationarxiv: v1 [cs.lg] 2 Mar 2017
How to Escape Saddle Points Efficiently Chi Jin Rong Ge Praneeth Netrapalli Sham M. Kakade Michael I. Jordan arxiv:1703.00887v1 [cs.lg] 2 Mar 2017 March 3, 2017 Abstract This paper shows that a perturbed
More informationFoundations of Deep Learning: SGD, Overparametrization, and Generalization
Foundations of Deep Learning: SGD, Overparametrization, and Generalization Jason D. Lee University of Southern California November 13, 2018 Deep Learning Single Neuron x σ( w, x ) ReLU: σ(z) = [z] + Figure:
More informationResearch Statement Qing Qu
Qing Qu Research Statement 1/4 Qing Qu (qq2105@columbia.edu) Today we are living in an era of information explosion. As the sensors and sensing modalities proliferate, our world is inundated by unprecedented
More informationGradient Descent. Dr. Xiaowei Huang
Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationCE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions
CE 191: Civil and Environmental Engineering Systems Analysis LEC : Optimality Conditions Professor Scott Moura Civil & Environmental Engineering University of California, Berkeley Fall 214 Prof. Moura
More informationCOR-OPT Seminar Reading List Sp 18
COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationReferences. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization
References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More informationUnconstrained Optimization
1 / 36 Unconstrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University February 2, 2015 2 / 36 3 / 36 4 / 36 5 / 36 1. preliminaries 1.1 local approximation
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationOptimisation non convexe avec garanties de complexité via Newton+gradient conjugué
Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationarxiv: v1 [math.pr] 11 Feb 2019
A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm arxiv:190.03736v1 math.pr] 11 Feb 019 Chi Jin University of California, Berkeley chijin@cs.berkeley.edu Rong Ge Duke
More informationRecent Advances in Non-Convex Optimization and its Implications to Learning
Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar.. U.C. Irvine.. ICML 2016 Tutorial Optimization at the heart of Machine Learning Unsupervised Learning Most
More informationarxiv: v2 [math.oc] 5 May 2018
The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationA picture of the energy landscape of! deep neural networks
A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationA random perturbation approach to some stochastic approximation algorithms in optimization.
A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department
More informationOn the Local Minima of the Empirical Risk
On the Local Minima of the Empirical Risk Chi Jin University of California, Berkeley chijin@cs.berkeley.edu Rong Ge Duke University rongge@cs.duke.edu Lydia T. Liu University of California, Berkeley lydiatliu@cs.berkeley.edu
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationDifferentiation of functions of covariance
Differentiation of log X May 5, 2005 1 Differentiation of functions of covariance matrices or: Why you can forget you ever read this Richard Turner Covariance matrices are symmetric, but we often conveniently
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationSecond Order Optimization Algorithms I
Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationarxiv: v1 [cs.lg] 17 Nov 2017
Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7
More information, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are
Quadratic forms We consider the quadratic function f : R 2 R defined by f(x) = 2 xt Ax b T x with x = (x, x 2 ) T, () where A R 2 2 is symmetric and b R 2. We will see that, depending on the eigenvalues
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationApproximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo
Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationIncremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method
Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationExpanding the reach of optimal methods
Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationOptimized first-order minimization methods
Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationOptimization for Machine Learning (Lecture 3-B - Nonconvex)
Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More information1 Introduction
2018-06-12 1 Introduction The title of this course is Numerical Methods for Data Science. What does that mean? Before we dive into the course technical material, let s put things into context. I will not
More informationA Second-Order Path-Following Algorithm for Unconstrained Convex Optimization
A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization Yinyu Ye Department is Management Science & Engineering and Institute of Computational & Mathematical Engineering Stanford
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationFirst-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2018 First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationRun-and-Inspect Method for Nonconvex Optimization and Global Optimality Bounds for R-Local Minimizers
Run-and-Inspect Method for Nonconvex Optimization and Global Optimality Bounds for R-Local Minimizers Yifan Chen Yuejiao Sun Wotao Yin Abstract Many optimization algorithms converge to stationary points.
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More information