Linear Convergence under the Polyak-Łojasiewicz Inequality

Similar documents
Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition

Coordinate Descent and Ascent Methods

CPSC 540: Machine Learning

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Convex Optimization Lecture 16

Conditional Gradient (Frank-Wolfe) Method

Non-convex optimization. Issam Laradji

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

CPSC 540: Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Stochastic and online algorithms

Barzilai-Borwein Step Size for Stochastic Gradient Descent

arxiv: v1 [math.oc] 1 Jul 2016

CPSC 540: Machine Learning

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Stochastic Optimization Algorithms Beyond SG

Nesterov s Acceleration

Non-Smooth, Non-Finite, and Non-Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Fast Stochastic Optimization Algorithms for ML

SVRG++ with Non-uniform Sampling

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Linearly-Convergent Stochastic-Gradient Methods

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Sub-Sampled Newton Methods

Greed is Good. Greedy Optimization Methods for Large-Scale Structured Problems. Julie Nutini

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Large-scale Stochastic Optimization

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Block Coordinate Descent for Regularized Multi-convex Optimization

Optimization for Machine Learning

Coordinate descent methods

Descent methods. min x. f(x)

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Convergence Rate of Expectation-Maximization

Selected Topics in Optimization. Some slides borrowed from

Lecture 23: November 21

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

arxiv: v1 [math.oc] 10 Oct 2018

Active-set complexity of proximal gradient

Introduction to Optimization

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

CPSC 540: Machine Learning

6. Proximal gradient method

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Parallel Coordinate Optimization

On Nesterov s Random Coordinate Descent Algorithms - Continued

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Stochastic Gradient Descent with Variance Reduction

Variance-Reduced and Projection-Free Stochastic Optimization

Proximal and First-Order Methods for Convex Optimization

Accelerating Stochastic Optimization

Big Data Analytics: Optimization and Randomization

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Second-Order Methods for Stochastic Optimization

Generalized Conditional Gradient and Its Applications

Block stochastic gradient update method

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

The Physical Systems Behind Optimization Algorithms

The Physical Systems Behind Optimization Algorithms

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Stochastic Optimization: First order method

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

arxiv: v2 [cs.lg] 14 Sep 2017

Expanding the reach of optimal methods

Optimization Tutorial 1. Basic Gradient Descent

ECS171: Machine Learning

Optimized first-order minimization methods

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Proximal Minimization by Incremental Surrogate Optimization (MISO)

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Coordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani

Convex Optimization Algorithms for Machine Learning in 10 Slides

Optimization and Gradient Descent

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Chapter 2. Optimization. Gradients, convexity, and ALS

COMS 4771 Regression. Nakul Verma

Stochastic Subgradient Method

10. Unconstrained minimization

arxiv: v1 [math.oc] 9 Oct 2018

Accelerated Proximal Gradient Methods for Convex Optimization

Transcription:

Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia

Linear of Convergence of Gradient-Based Methods Fitting most machine learning models involves optimization. Most common algorithm is gradient descent (GD) and variants: Stochastic gradient, quasi-newton, coordinate descent, and so on. Standard global convergence rate result for GD: Smoothness + Strong-Convexity Linear Convergence Error on iteration t is O(ρ t ). But even simple models are often not strongly-convex. Least squares, logistic regression, etc. This talk: How much can we relax strong-convexity? Smoothness +??? Strong-Convexity Linear Convergence

Polyak-Łojasiewicz (PL) Inequality Polyak [1963] showed linear convergence of GD assuming 1 2 f(x) 2 µ(f(x) f ), that gradient grows as quadratic function of sub-optimality. Holds for SC problems, but also problems of the form f(x) = g(ax), for strongly-convex g. Includes least squares, logistic regression (on compact set), etc. A special case of the Łojasiewicz inequality [1963]. We ll call this the Polyak-Łojasiewicz (PL) inequality. Using the PL inequality we can show Smoothness + PL Inequality Strong-Convexity Linear Convergence

Linear Convergence of GD under the PL Inequality Consider the basic unconstrained smooth optimization, argmin f(x), x R d where f satisfies the PL inequality and f is Lipschitz continuous, f(y) f(x) + f(x), y x + L 2 y x 2. Applying GD with a constant step-size of 1/L, we have x k+1 = x k 1 L f(x k), f(x k+1 ) f(x k ) + f(x k ), x k+1 x k + L 2 x k+1 x k 2 = f(x k ) 1 2L f(x k 2 f(x k ) µ L [f(x k) f ]. Subtracting f and applying recursively gives global linear rate, ( f(x k ) f 1 µ ) k [f(x 0 ) f ]. L

Linear Convergence under the PL Inequality Proof is simple (simpler than than with SC). Does not require uniqueness of solution (unlike SC). Does not imply convexity (unlike SC).

Weaker Conditions than Strong Convexity (SC) How does PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo and Tseng, 1993]. QG: quadratic growth [Anitescu, 2000] ESC: essential strong convexity [Liu et al., 2013]. RSI: restricted secant inequality [Zhang & Yin, 2013]. RSI plus convexity is restricted strong-convexity. Semi-strong convexity [Gong & Ye, 2014]. Equivalent to QG plus convexity. Optimal strong convexity [Liu & Wright, 2015]. Equivalent to QG plus convexity. WSC: weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

Relationships Between Conditions For a function f with a Lipschitz-continuous gradient, we have: (SC) (ESC) (WSC) (RSI) (EB) (PL) (QG). If we further assume that f is convex, then (RSI) (EB) (PL) (QG). QG is the weakest condition but allows non-global local minima. PL EB are most general conditions giving global min. More recent condition are not needed.

PL Inequality and Invexity While PL imply doesn t convexity, it implies invexity. For smooth f, invexity all stationary points are global optimum. Example of invex but non-convex function satisfying PL: f(x) = x 2 + 3 sin 2 (x). Maybe strong invexity is a better name?

PL Inequality and Non-Convex Functions Many important models don t satisfy invexity. For these problems we often divide analysis into two phases: Global convergence: iterations needed to get close to minimizer. Local convergence: how fast does it converge near the minimizer. Usually, local convergence assumes SC near minimizer. If we assume PL, local convergence phase may be much earlier. Holds for R on this function even though non-convex on [ 3, 3].

Convergence of Huge-Scale Methods For large datasets, we typically don t use GD. But the PL inequality can be used to analyze other algorithms. We ll use PL for coordinate descent and stochastic gradient. Garber & Hazan [2015] consider Frank-Wolfe. Reddi et al. [2016] consider other stochastic algorithms. In Karimi et al. [2016] we consider sign-based gradient methods.

Random and Greedy Coordinate Descent For randomized coordinate descent under PL we have ( E[f(x k ) f ] 1 µ ) k [f(x 0 ) f ], dl c where L c is coordinate-wise Lipschitz constant of f. Faster than GD rate if iterations are d times cheaper. For greedy coordinate descent under PL we have faster rate f(x k ) f ] ( 1 µ ) k 1 [f(x 0 ) f ], L c where µ 1 is the PL constant in the L -norm, f(x) 2 2µ 1 (f(x) f ). Gives rate for some boosting variants [Meir and Rätsch, 2003].

Stochastic Gradient Methods Stochastic gradient (SG) methods apply to general problem argmin f(x) = E[f i (x)], x R d and we usually focus on the special case of a finite sum f(x) = 1 n n f i (x). i SG methods use the iteration x k+1 = x k α k f ik (x k ), where f ik is an unbiased gradient approximation.

Stochastic Gradient Methods With α k = 2k+1 2µ(k+1) the SG method satisfies 2 E[f(x k ) f ] Lσ2 2kµ 2, while with α k set to constant α we have E[f(x k ) f ] (1 2µα) k [f(x 0 ) f ] + Lσ2 α 4µ. O(1/k) rate without strong-convexity (or even convexity). Fast reduction of sub-optimality under small constant step size. Our work and Reddi et al. [2016] consider finite sum case: Analyze stochastic variance-reduced gradient (SVRG) method. Obtain linear convergence rates.

PL Generalizations for Non-Smooth Problems What can we say about non-smooth problems? Well-known generalization of PL is the KL inequality. Attach and Bolte [2009] show linear rate for proximal-point. But proximal-gradient methods are more relevant for ML. KL inequality has been used to show local rate for this method. We propose different PL generalization giving simple global rate.

Proximal-PL Inequality Proximal-gradient methods apply to the problem argmin F (x) = f(x) + g(x), x R d where f is L-Lipschitz but g may be non-smooth. For example, L1-regularization or simple convex constraints. We say that F satisfies the proximal-pl inequality if where D g(x, L) 2µ(F (x) F ), D g(x, α) 2α min y { f(x), y x + α y x 2 + g(y) g(x) }. Condition is ugly but it yields extremely-simple proof: F (x k+1 ) = f(x k+1 ) + g(x k ) + g(x k+1 ) g(x k ) F (x k ) + f(x k ), x k+1 x k + L 2 x k+1 x k 2 + g(x k+1 ) g(x k ) F (x k ) 1 2L Dg(x k, L) F (x k ) µ L [F (x k) F ] F (x k ) F ( 1 µ L ) k [F (x 0 ) F ].

Relevant Problems for Proximal-PL Proximal PL is satisfied when: f is strongly-convex. f satisfies PL and g is constant. f = h(ax) for SC g and is indicator of polyhedral set. F is convex and satisfies QG. Any function satisfying KL inequality. We ve shown that proxlmal-pl and KL inequality are equivalent. Includes dual support vector machine (SVM) problem. Includes L1-regularized least squares (LASSO) problem: No need for RIP, homotopy, modified restricted strong convexity,... We also analyze proximal coordinate descent under PL. Implies linear rate of SDCA for SVMs. Reddi et al. [2016] analyze proximal-svrg and proximal-saga.

Summary In 1963, Polyak proposed a condition for linear rate of GD. Gives trivial proof and is weaker than more recent conditions. We can use the inequality to analyze huge-scale methods: Coordinate descent, stochastic gradient, SVRG, etc. We give proximal-gradient generalization: Standard algorithms have linear rate for SVM and LASSO. Thank you for the invitation!