Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Similar documents
Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Optimization for Machine Learning

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Big Data Analytics: Optimization and Randomization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality

Large-scale Stochastic Optimization

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Trade-Offs in Distributed Learning and Optimization

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

arxiv: v2 [cs.lg] 4 Oct 2016

ECS289: Scalable Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

STA141C: Big Data & High Performance Statistical Computing

Comparison of Modern Stochastic Optimization Algorithms

Stochastic Composition Optimization

Nesterov s Acceleration

Stochastic and online algorithms

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Coordinate Descent and Ascent Methods

On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed

SVRG++ with Non-uniform Sampling

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Lecture 3: Minimizing Large Sums. Peter Richtárik

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

arxiv: v2 [math.oc] 1 Nov 2017

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Linearly-Convergent Stochastic-Gradient Methods

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

arxiv: v1 [math.oc] 7 Jun 2018

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

arxiv: v3 [cs.lg] 15 Sep 2018

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Stochastic gradient methods for machine learning

Lecture 1: Supervised Learning

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Asaga: Asynchronous Parallel Saga

arxiv: v1 [math.oc] 1 Jul 2016

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Fast Stochastic Optimization Algorithms for ML

Stochastic gradient methods for machine learning

arxiv: v1 [stat.ml] 20 May 2017

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

Accelerating SVRG via second-order information

MASAGA: A Stochastic Algorithm for Manifold Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Stochastic Gradient Descent with Variance Reduction

Optimization for Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Convex Optimization Lecture 16

Non-convex optimization. Issam Laradji

Learning with stochastic proximal gradient

Non-Smooth, Non-Finite, and Non-Convex Optimization

Nonlinear Optimization Methods for Machine Learning

Why should you care about the solution strategies?

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Selected Topics in Optimization. Some slides borrowed from

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

A Universal Catalyst for Gradient-Based Optimization

Mini-Course 1: SGD Escapes Saddle Points

Stochastic Optimization

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Stochastic gradient descent and robustness to ill-conditioning

Optimistic Rates Nati Srebro

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

A picture of the energy landscape of! deep neural networks

arxiv: v1 [math.oc] 7 Dec 2018

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand,

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Stochastic Quasi-Newton Methods

arxiv: v4 [math.oc] 10 Jun 2017

ONLINE VARIANCE-REDUCING OPTIMIZATION

Stochastic Optimization Algorithms Beyond SG

Overparametrization for Landscape Design in Non-convex Optimization

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Transcription:

Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu

Nonconvex problems are Nonconvex optimization problem with simple constraints X 2 X min a iz i s + z i(1 z i ) i i s.t. 0 apple z i apple 1, i =1,...,n. Question: Is global min of this problem 0 or not? Does there exist a subset of {a 1,a 2,...,a n } that sums to s? Subset-sum problem, well-known NP-Complete prob. min x > Ax, x 0 Question: Is x=0 a local minimum or not? 2

Nonconvex finite-sum problems 1 min 2R d n nx i=1 ` y i, DNN(x i, ) + ( ) x 1 x 2 F (x; ) x p 3

Nonconvex finite-sum problems nx Related work min 2R d g( ) = 1 n i=1 f i ( ) Original SGD paper (Robbins, Monro 1951) (asymptotic convergence; no rates) t t H t rf( t ) SGD with scaled gradients ( ) + other tricks: space dilation, (Shor, 1972); variable metric SGD (Uryasev 1988); AdaGrad (Duchi, Hazan, Singer, 2012); Adam (Kingma, Ba, 2015), and many others (typically asymptotic convergence for nonconvex) Large number of other ideas, often for step-size tuning, initialization (see e.g., blog post: by S. Ruder on gradient descent algorithms) Going beyond SGD (theoretically; ultimately in practice too) 4

Nonconvex finite-sum problems min 2R d g( ) = 1 n nx i=1 f i ( ) Related work (subset) (Solodov, 1997) Incremental gradient, smooth nonconvex (asymptotic convergence; no rates proved) (Bertsekas, Tsitsiklis, 2000) Gradient descent with errors; incremental (see 2.4, Nonlinear Programming; no rates proved) (Sra, 2012) Incremental nonconvex non-smooth (asymptotic convergence only) (Ghadimi, Lan, 2013) SGD for nonconvex stochastic opt. (first non-asymptotic rates to stationarity) (Ghadimi et al., 2013) SGD for nonconvex non-smooth stoch. opt. (non-asymptotic rates, but key limitations) 5

Difficulty of nonconvex optimization 3 2 1 0 So,-2 try to see how fast we can satisfy this necessary condition -3-1 -4-6 -4-2 0 2 4 6 Difficult to optimize, but rg( ) = 0 necessary condition local minima, maxima, saddle points satisfy it. 6

Measuring efficiency of nonconvex opt. Convex: E[g( t ) g ] apple (optimality gap) Nonconvex: (Nesterov 2003, Chap 1); (Ghadimi, Lan, 2012) E[krg( t )k 2 ] apple (stationarity gap) Incremental First-order Oracle (IFO) (Agarwal, Bottou, 2014) (see also: Nemirovski, Yudin, 1983) (x, i) (f i (x), rf i (x)) Measure: #IFO calls to attain ε accuracy 7

IFO Example: SGD vs GD (nonconvex) min 2R d g( ) = 1 n nx i=1 f i ( ) SGD t+1 = t rf it ( t ) O(1) IFO calls per iter O(1/ε 2 ) iterations Total: O(1/ε 2 ) IFO calls independent of n? t+1 GD = xt rg( t) O(n) IFO calls per liter O(1/ε) iterations Total: O(n/ε) IFO calls depends strongly on n (Ghadimi, Lan, 2013,2014) (Nesterov, 2003; Nesterov 2012) assuming Lipschitz gradients E[krg( t )k 2 ] apple 8

Nonconvex finite-sum problems nx min 2R d g( ) = 1 n i=1 f i ( ) SGD GD t+1 = t rf it ( t ) t+1 = x t rg( t ) Do these benefits extend to nonconvex SAG, SVRG, SAGA, finite-sums? et al. Analysis depends heavily on convexity (especially for controlling variance) 9

SVRG/SAGA work again! (with new analysis) 10

Nonconvex SVRG for s=0 to S-1 s+1 0 s m s s m end for t=0 to m-1 Uniformly randomly pick i(t) 2 {1,...,n} end t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 i rf i ( s ) The same algorithm as usual SVRG (Johnson, Zhang, 2013) 11

Nonconvex SVRG for s=0 to S-1 s+1 0 s m s s m end for t=0 to m-1 end Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 i rf i ( s ) 12

Nonconvex SVRG for s=0 to S-1 s+1 0 s m s s m end for t=0 to m-1 end Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 i rf i ( s ) 13

Nonconvex SVRG for s=0 to S-1 s+1 0 s m s s m end for t=0 to m-1 end Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 i rf i ( s ) 14

Nonconvex SVRG for s=0 to S-1 s+1 0 s m end s for t=0 to m-1 end s m Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 i rf i ( s ) 15

Nonconvex SVRG for s=0 to S-1 s+1 0 s m end s for t=0 to m-1 end s m Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx {z i=1 } t E[ t] =0 i rf i ( s ) 16

Nonconvex SVRG for s=0 to S-1 s+1 0 s m end s for t=0 to m-1 end s m Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i=1 {z } Full gradient, computed once every epoch i rf i ( s ) 17

Nonconvex SVRG for s=0 to S-1 s+1 0 s m s s m Key quantities that determine how the method operates end for t=0 to m-1 end Uniformly randomly pick i(t) 2 {1,...,n} t+1 s+1 = s+1 t t h rf i(t) ( s+1 t ) rf i(t) ( s )+ 1 n nx i rf i ( s ) i=1 {z } Full gradient, computed once every epoch 18

Key ideas for analysis of nc-svrg Previous SVRG proofs rely on convexity to control variance Larger step-size smaller inner loop (full-gradient computation dominates epoch) Smaller step-size slower convergence (longer inner loop) (Carefully) trading-off #inner-loop iterations m with step-size η leads to lower #IFO calls! (Reddi, Hefny, Sra, Poczos, Smola, 2016; Allen-Zhu, Hazan, 2016) 19

Faster nonconvex optimization via VR (Reddi, Hefny, Sra, Poczos, Smola, 2016; Reddi et al., 2016) Algorithm SGD GD SVRG Nonconvex (Lipschitz smooth) O 1 2 O n O n + n2/3 SAGA MSVRG O O n + n2/3 min 1, n2/3 2 Remarks New results for convex case too; additional nonconvex results For related results, see also (Allen-Zhu, Hazan, 2016) E[krg( t )k 2 ] apple 20

Linear rates for nonconvex problems min 2R d g( ) = 1 n nx f i ( ) The Polyak-Łojasiewicz (PL) class of functions i=1 g( ) g( ) apple 1 2µ krg( )k2 (Polyak, 1963); (Łojasiewicz, 1963) Examples: μ-strongly convex PL holds Stochastic PCA **, some large-scale eigenvector problems (More general than many other restricted strong convexity uses) (Karimi, Nutini, Schmidt, 2016) (Attouch, Bolte, 2009) (Bertsekas, 2016) proximal extensions; references more general Kurdya-Łojasiewicz class textbook, more growth conditions 21

Linear rates for nonconvex problems g( ) g( ) apple 1 2µ krg( )k2 E[g( t ) g ] apple Algorithm Nonconvex Nonconvex-PL SGD GD SVRG SAGA MSVRG O O 1 2 O n O n + n2/3 O n + n2/3 min 1 2, n2/3 O O O O 1 2 n 2µ log 1 (n + n2/3 2µ ) log 1 (n + n2/3 2µ ) log 1 Variant of nc-svrg attains this fast convergence! (Reddi, Hefny, Sra, Poczos, Smola, 2016; Reddi et al., 2016) 22

krf( t)k 2 Empirical results CIFAR10 dataset; 2-layer NN 23

Some surprises! 1 min 2R d n nx i=1 f i ( )+ ( ) Regularizer, e.g., k k 1 for enforcing sparsity of weights (in a neural net, or more generally); or an indicator function of a constraint set, etc. 24 Suvrit Sra (ml.mit.edu) Beyond stochastic gradients and convexity: Part 2

Nonconvex composite objective problems 1 min 2R d n nx i=1{z } nonconvex f i ( )+ ( ) convex Prox-SGD t+1 =prox t ( t t rf it ( t )) Prox-SGD convergence 1 prox not known!* (v) := argmin u 2 ku vk2 + (u) prox: soft-thresholding for k k 1 ; projection for indicator function Partial results: (Ghadimi, Lan, Zhang, 2014) (using growing minibatches, shrinking step sizes) * Except in special cases (where even rates may be available) 25

Empirical results: NN-PCA f(x)! f(^x) min SGD 10-5 10-10 covtype (581012,54) SAGA SVRG kwkapple1, w 0 1 2 w> f(x)! f(^x)! nx x i x > i w SGD 10-5 10-10 rcv1 (677399, 47236) i=1 SAGA SVRG 10-15 0 5 10 15 20 # grad/n y-axis denotes distance f( ) 10-15 f(ˆ ) 0 5 10 15 # grad/n to an approximate optimum Eigenvecs via SGD: (Oja, Karhunen 1985); via SVRG (Shamir, 2015,2016); (Garber, Hazan, Jin, Kakade, Musco, Netrapalli, Sidford, 2016); and many more! 26

Finite-sum problems with nonconvex g(θ) and params θ lying on a known manifold min 2M g( ) = 1 n nx i=1 f i ( ) Example: eigenvector problems (the θ =1 constraint) problems with orthogonality constraints low-rank matrices positive definite matrices / covariances 27 Suvrit Sra (ml.mit.edu) Beyond stochastic gradients and convexity: Part 2

Nonconvex optimization on manifolds (Zhang, Reddi, Sra, 2016) min 2M g( ) = 1 n nx i=1 f i ( ) Related work (Udriste, 1994) batch methods; textbook (Edelman, Smith, Arias, 1999) classic paper; orthogonality constraints (Absil, Mahony, Sepulchre, 2009) textbook; convergence analysis (Boumal, 2014) phd thesis, algos, theory, examples (Mishra, 2014) phd thesis, algos, theory, examples manopt excellent matlab toolbox (Bonnabel, 2013) Riemannnian SGD, asymptotic convg. and many more! Exploiting manifold structure yields speedups 28

Example: Gaussian Mixture Model p mix (x) := KX k=1 k p N (x; k,µ k ) Likelihood max Y i p mix(x i ) Numerical challenge: positive definite constraint on Σk T X Riemannian (new) EM Cholesky LL T X X Algo S d + [Hosseini, Sra, 2015] 29

Careful use of manifold geometry helps! K EM R-LBFGS 2 17s 29.28 14s 29.28 5 202s 32.07 117s 32.07 10 2159s 33.05 658s 33.06 Riemannian-LBFGS (careful impl.) images dataset d=35, n=200,000 github.com/utvisionlab/mixest 30

Large-scale Gaussian mixture models! 10 1 10 0 Averaged Cost Di erence 10 1 10 2 10 3 10 4 10 5 SGD (it=5) SGD (it=20) SGD (it=50) EM LBFGS CG 0 20 40 60 80 100 120 140 160 180 200 Iterations Riemannian SGD for GMMs (d=90, n=515345) 31

Larger-scale optimization min 2R d g( ) = 1 n nx i=1 f i ( ) 32 Suvrit Sra (ml.mit.edu) Beyond stochastic gradients and convexity: Part 2

Simplest setting: using mini-batches Idea: Use b stochastic gradients / IFO calls per iteration useful in parallel and distributed settings increases parallelism, reduces communication X t SGD t+1 = t rf j ( t ) I t j2i t For batch size b, SGD takes a factor 1/ p b fewer iterations (Dekel, Gilad-Bachrach, Shamir, Xiao, 2012) For batch size b, SVRG takes a factor 1/b fewer iterations Theoretical linear speedup with parallelism see also S2GD (convex case): (Konečný, Liu,Richtárik,Takáč, 2015) 33

Asynchronous stochastic algorithms SGD t+1 = t t I t X j2i t rf j ( t ) Inherently sequential algorithm Slow-downs in parallel/dist settings (synchronization) Classic results in asynchronous optimization: (Bertsekas, Tsitsiklis, 1987) Asynchronous SGD implementation (HogWild!) Avoids need to sync, operates in a lock-free manner Key assumption: sparse data (often true in ML) but It is still SGD, thus has slow sublinear convergence even for strongly convex functions 34

Asynchronous algorithms: parallel Does variance reduction work with asynchrony? Yes! ASVRG (Reddi, Hefny, Sra, Poczos, Smola, 2015) ASAGA (Leblond, Pedregosa, Lacoste-Julien, 2016) Perturbed iterate analysis (Mania et al, 2016) a few subtleties involved some gaps between theory and practice more complex than async-sgd Bottomline: on sparse data, can get almost linear speedup due to parallelism (π machines lead to ~ π speedup) 35

Asynchronous algorithms: distributed common parameter server architecture (Li, Andersen, Smola, Yu, 2014) Classic ref: (Bertsekas, Tsitsiklis, 1987) D-SGD: workers compute (stochastic) gradients server computes parameter update widely used (centralized) design choice can have quite high communication cost Asynchrony via: servers use delayed / stale gradients from workers (Nedic, Bertsekas, Borkar, 2000; Agarwal, Duchi 2011) and many others (Shamir, Srebro 2014) nice overview of distributed stochastic optimization 36

Asynchronous algorithms: distributed To reduce communication, following idea is useful: Data Workers D1 D2 Dm W1 W2 Wm Worker nodes solve compute intensive subproblems Servers S1 Sk Servers perform simple aggregation (eg. full-gradients for distributed SVRG) DANE (Shamir, Srebro, Zhang, 2013): distributed Newton, view as having an SVRG-like gradient correction 37

Asynchronous algorithms: distributed Key point: Use SVRG (or related fast method) to solve suitable subproblems at workers; reduce #rounds of communication; (or just do D-SVRG) Some related work (Lee, Lin, Ma, Yang, 2015) (Ma, Smith, Jaggi, Jordan, Richtárik, Takáč, 2015) (Shamir, 2016) D-SVRG, and accelerated version for some special cases (applies in smaller condition number regime) CoCoA+: (updates m local dual variables using m local data points; any local opt. method can be used); higher runtime+comm. D-SVRG via cool application of without replacement SVRG! regularized least-squares problems only for now Several more: DANE, DISCO, AIDE, etc. 38

Summary VR stochastic methods for nonconvex problems Surprises for proximal setup Nonconvex problems on manifolds Large-scale: parallel + sparse data Large-scale: distributed; SVRG benefits, limitations If there is a finite-sum structure, can use VR ideas! 39

Perspectives: did not cover these! Stochastic quasi-convex optim. (Hazan, Levy, Shalev-Shwartz, 2015) Nonlinear eigenvalue-type problems (Belkin, Rademacher, Voss, 2016) Frank-Wolfe + SVRG: (Reddi, Sra, Poczos, Smola,2016) Newton-type methods: (Carmon, Duchi, Hinder, Sidford, 2016); (Agarwal, Allen-Zhu, Bullins, Hazan, Ma, 2016); many more, including robust optimization, infinite dimensional nonconvex problems geodesic-convexity for global optimality polynomial optimization many more it s a rich field! 40

Perspectives Impact of non-convexity on generalization Non-separable problems (e.g., maximize AUC); saddle point problems; robust optimization; heavy tails Convergence theory, local and global Lower-bounds for nonconvex finite-sums Distributed algorithms (theory and implementations) New applications (e.g., of Riemannian optimization) Search for other more tractable nonconvex models Specialization to deep networks, software toolkits 41