How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Similar documents
Mini-Course 1: SGD Escapes Saddle Points

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

arxiv: v1 [math.oc] 9 Oct 2018

Non-convex optimization. Issam Laradji

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

A Conservation Law Method in Optimization

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Overparametrization for Landscape Design in Non-convex Optimization

arxiv: v2 [math.oc] 5 Nov 2017

Performance Surfaces and Optimum Points

Nonlinear Optimization Methods for Machine Learning

SVRG Escapes Saddle Points

Introduction to gradient descent

Lecture 5: September 12

Stochastic Cubic Regularization for Fast Nonconvex Optimization

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

arxiv: v2 [math.oc] 1 Nov 2017

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

arxiv: v4 [math.oc] 11 Jun 2018

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v1 [cs.lg] 6 Sep 2018

arxiv: v1 [cs.lg] 2 Mar 2017

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Research Statement Qing Qu

Gradient Descent. Dr. Xiaowei Huang

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Higher-Order Methods

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

COR-OPT Seminar Reading List Sp 18

Gradient Methods Using Momentum and Memory

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Unconstrained Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

ECS171: Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Optimization for neural networks

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

ECS289: Scalable Machine Learning

Unconstrained optimization

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Accelerating Stochastic Optimization

Stochastic Optimization Algorithms Beyond SG

Advanced computational methods X Selected Topics: SGD

arxiv: v1 [math.pr] 11 Feb 2019

Recent Advances in Non-Convex Optimization and its Implications to Learning

arxiv: v2 [math.oc] 5 May 2018

Gradient Descent. Sargur Srihari

Composite nonlinear models at scale

arxiv: v1 [math.oc] 1 Jul 2016

A picture of the energy landscape of! deep neural networks

Optimization methods

Lecture 1: Supervised Learning

A random perturbation approach to some stochastic approximation algorithms in optimization.

On the Local Minima of the Empirical Risk

CPSC 540: Machine Learning

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Differentiation of functions of covariance

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Second Order Optimization Algorithms I

Stochastic Gradient Descent with Variance Reduction

arxiv: v1 [cs.lg] 17 Nov 2017

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Adaptive Gradient Methods AdaGrad / Adam

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Optimization for Training I. First-Order Methods Training algorithm

Neural Network Training

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Stochastic and online algorithms

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Unconstrained minimization of smooth functions

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Information theoretic perspectives on learning algorithms

Expanding the reach of optimal methods

Stochastic Optimization

Optimized first-order minimization methods

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Convex Optimization Lecture 16

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

1 Introduction

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

Optimization for Machine Learning

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality

Accelerated primal-dual methods for linearly constrained convex problems

Run-and-Inspect Method for Nonconvex Optimization and Global Optimality Bounds for R-Local Minimizers

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Transcription:

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington

Nonconvex optimization Problem: min x f x f : nonconvex function Applications: Deep learning, compressed sensing, matrix completion, dictionary learning, nonnegative matrix factorization,

Gradient descent (GD) [Cauchy 1847] Question How does it perform? x t+1 = x t η f x t

Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points

Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points Definition ϵ-first order stationary point (ϵ-fosp) : f(x) ϵ

Gradient descent (GD) [Cauchy 1847] x t+1 = x t η f x t Question How does it perform? Answer Converges to first order stationary points Definition ϵ-first order stationary point (ϵ-fosp) : f(x) ϵ Concretely ϵ-fosp in O 1 ϵ 2 iterations [Folklore]

How do FOSPs look like?

How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP)

How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP) Hessian NSD 2 f x 0

How do FOSPs look like? Hessian PSD 2 f x 0 Second order stationary points (SOSP) Hessian NSD 2 f x 0 Hessian indefinite λ min ( 2 f x ) 0 λ max ( 2 f x ) 0

FOSPs in popular problems Very well studied Neural networks [Dauphin et al. 2014] Matrix sensing [Bhojanapalli et al. 2016] Matrix completion [Ge et al. 2016] Robust PCA [Ge et al. 2017] Tensor factorization [Ge et al. 2015, Ge & Ma 2017] Smooth semidefinite programs [Boumal et al. 2016] Synchronization & community detection [Bandeira et al. 2016, Mei et al. 2017]

Two major observations FOSPs: proliferation (exponential #) of saddle points Recall FOSP f x = 0 Gradient descent can get stuck near them SOSPs: not just local minima; as good as global minima Recall SOSP f x = 0 & 2 f x 0 Upshot 1. FOSP not good enough 2. Finding SOSP sufficient

Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information

Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information Question 1 Does perturbed GD converge to SOSP efficiently? In particular, independent of d?

Can gradient descent find SOSPs? Yes, perturbed GD finds an ε-sosp in O poly d ε iterations [Ge et al. 2015] GD is a first order method while SOSP captures second order information Question 1 Does perturbed GD converge to SOSP efficiently? In particular, independent of d? Our result Almost yes, in O polylog(d) ε 2 iterations!

Accelerated gradient descent (AGD) [Nesterov 1983] Optimal algorithm in the convex setting Practice: Sutskever et al. 2013 observed AGD to be much faster than GD Widely used in training neural networks since then Theory: Finds an ϵ-fosp in O 1 ϵ 2 iterations [Ghadimi & Lan 2013] No improvement over GD

Question 2: Does essentially pure AGD find SOSPs faster than GD? Our result: Yes, in O polylog(d) ε 1.75 steps compared to O polylog(d) ε 2 for GD Perturbation + negative curvature exploitation (NCE) on top of AGD NCE inspired by Carmon et al. 2017 Carmon et al. 2016 and Agarwal et al. 2017 show this improved rate for a more complicated algorithm Solve sequence of regularized problems using AGD

Summary ε-sosp [Nesterov & Polyak 2006] f x ε & λ min 2 f x ε Convergence to SOSPs very important in practice Pure GD and AGD can get stuck near FOSPs (saddle points) Algorithm Paper # Iterations Simplicity Perturbed gradient descent Sequence of regularized subproblems with AGD Ge et al. 2015 Levy 2016 O poly d ε Jin, Ge, N., Kakade, Jordan 2017 Carmon et al. 2016 Agarwal et al. 2017 Perturbed AGD + NCE Jin, N., Jordan 2017 O polylog(d) ε 2 O polylog(d) ε 1.75 O polylog(d) ε 1.75 Single loop Single loop Nested loop Single loop

Part I Main Ideas of the Proof of Gradient Descent

Setting Gradient Lipschitz: f x f y x y Hessian Lipschitz: 2 f x 2 f y x y Lower bounded: min x f x >

How does GD behave? GD step x t+1 x t η f x t Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0

f x t How does GD behave? GD step x t+1 x t η f x t small f x t large Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0 x t f x t η f x t f x t+1 SOSP Saddle point f x t+1 f x t η 2 f x t 2

f x t How does GD behave? GD step x t+1 x t η f x t small f x t large Recall FOSP: f x small SOSP: f x small & λ min 2 f x 0 x t f x t η f x t f x t+1 SOSP Saddle point f x t+1 f x t η 2 f x t 2

How to escape saddle points?

Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t

Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t Between two perturbations, just run GD!

Perturbed gradient descent 1. For t = 0,1, do 2. if perturbation_condition_holds then 3. x t x t + ξ t where ξ t Unif B 0 ε 4. x t+1 x t η f x t 1. f x t is small 2. No perturbation in last several iterations Between two perturbations, just run GD!

How can perturbation help?

Key question S set of points around saddle point from where gradient descent does not escape quickly Escape function value decreases significantly How much is Vol S? Vol S small perturbed GD escapes saddle points efficiently

Two dimensional quadratic case f x = 1 2 x 1 0 0 1 x λ min H = 1 < 0 0,0 is a saddle point GD: x t+1 = 1 η 0 0 1 + η x t B 0,0 0,0 S S is a thin strip, Vol S is small

Three dimensional quadratic case f x = 1 2 x 1 0 0 0 1 0 0 0 1 x B 0,0,0 0,0,0 is a saddle point GD: x t+1 = 1 η 0 0 0 1 η 0 0 0 1 + η x t 0,0,0 S S is a thin disc, Vol S is small

General case Key technical results S thin deformed disc Vol S is small

Two key ingredients of the proof Improve or localize f x t+1 f x t η 2 f x t 2 = f x t η 2 x t x t+1 η 2 x t x t+1 2 2η f x t f(x t+1 ) x 0 x t t 1 2 t i=0 x i x i+1 2 2ηt f x 0 f x t

Two key ingredients of the proof Improve or localize Upshot Either function value decreases significantly or iterates do not move much x 0 x t t 1 2 t i=0 x i x i+1 2 2ηt f x 0 f x t

Proof idea If GD from either u or w goes outside a small ball, it escapes (function value ) Coupling If GD from both u and w lie in a small ball, use local quadratic approximation of f( ) Show the claim for exact quadratic, and bound approximation error using Hessian Lipschitz property Either GD from u escapes Or GD from w escapes

Putting everything together GD step x t+1 x t η f x t f x t large f( ) decreases f x t small Saddle point Perturbation + GD Moves away from SOSP SOSP Perturbation + GD Stays at SOSP

Part II Main Ideas of the Proof of Accelerated Gradient Descent

Nesterov s AGD Iterate x t & Velocity v t 1. x t+1 = x t + 1 θ v t η f x t + 1 θ v t 2. v t+1 = x t+1 x t Gradient descent at x t + 1 θ v t Challenge Known potential functions depend on optimum x

Differential equation view of AGD AGD is a discretization of the following ODE [Su et al. 2015] x ሷ + θ x ሶ + f x = 0 Multiplying by xሶ and integrating from t 1 to t 2 gives us f x t2 + 1 2 2 xሶ t2 = f xt1 + 1 2 xሶ t1 t 2 2 θ න t 1 xሶ t 2 dt Hamiltonian f x t + 1 2 xሶ t 2 decreases monotonically

After discretization Iterate: x t and velocity: v t x t x t 1 Hamiltonian f x t + 1 v 2η t 2 decreases monotonically if f not too nonconvex between x t and x t + v t too nonconvex = negative curvature Can increase if f is too nonconvex If the function is too nonconvex, reset velocity or move in nonconvex direction negative curvature exploitation

Hamiltonian decrease f between x t and x t + v t Not too nonconvex Too nonconvex AGD step v t large v t small v t+1 = 0 Move in ±v t direction f x t + 1 2η v t 2 decreases

Negative curvature exploitation v t small f x t x t x t + v t x t v t One of ±v t directions decreases f x t

Hamiltonian decrease f between x t and x t + v t Not too nonconvex Too nonconvex (Negative curvature exploitation) AGD step v t large v t small Need to do amortized analysis v t+1 = 0 f x t + 1 2η v t 2 decreases Move in ±v t direction Enough decrease in a single step

Improve or localize f x t+1 + 1 2η v t+1 2 f x t + 1 2η v t 2 θ 2η v t 2 T 1 t=0 x t+1 x t 2 2η θ f x 0 f(x T ) Approximate locally by a quadratic and perform computations Precise computations are technically challenging

Summary Simple variations to GD/AGD ensure efficient escape from saddle points Fine understanding of geometric structure around saddle points Novel techniques of independent interest Some extensions to stochastic setting

Open questions Is NCE really necessary? Lower bounds recent work by Carmon et al. 2017, but gaps between upper and lower bounds Extensions to stochastic setting Nonconvex optimization for faster algorithms

Thank you! Questions?