Linear Convergence under the Polyak-Łojasiewicz Inequality

Similar documents
Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition

Coordinate Descent and Ascent Methods

Conditional Gradient (Frank-Wolfe) Method

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Convex Optimization Lecture 16

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Non-convex optimization. Issam Laradji

arxiv: v1 [math.oc] 1 Jul 2016

Stochastic and online algorithms

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

CPSC 540: Machine Learning

Sub-Sampled Newton Methods

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Block Coordinate Descent for Regularized Multi-convex Optimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Nesterov s Acceleration

Stochastic Optimization Algorithms Beyond SG

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CPSC 540: Machine Learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Active-set complexity of proximal gradient

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

SVRG++ with Non-uniform Sampling

Linearly-Convergent Stochastic-Gradient Methods

Greed is Good. Greedy Optimization Methods for Large-Scale Structured Problems. Julie Nutini

Fast Stochastic Optimization Algorithms for ML

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Coordinate descent methods

Block stochastic gradient update method

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Large-scale Stochastic Optimization

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Non-Smooth, Non-Finite, and Non-Convex Optimization

Optimization for Machine Learning

Descent methods. min x. f(x)

CPSC 540: Machine Learning

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725

An interior-point stochastic approximation method and an L1-regularized delta rule

arxiv: v1 [math.oc] 10 Oct 2018

Parallel Coordinate Optimization

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Expanding the reach of optimal methods

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Lecture 23: November 21

Stochastic Quasi-Newton Methods

Unconstrained minimization of smooth functions

Selected Topics in Optimization. Some slides borrowed from

Primal-dual coordinate descent

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Subgradient Method. Ryan Tibshirani Convex Optimization

Second-Order Methods for Stochastic Optimization

Convergence Rate of Expectation-Maximization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Proximal and First-Order Methods for Convex Optimization

Stochastic Gradient Descent with Variance Reduction

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

On Nesterov s Random Coordinate Descent Algorithms - Continued

Stochastic Subgradient Method

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

The Physical Systems Behind Optimization Algorithms

The Physical Systems Behind Optimization Algorithms

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

arxiv: v1 [math.oc] 9 Oct 2018

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Generalized Conditional Gradient and Its Applications

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

6. Proximal gradient method

Stochastic Optimization: First order method

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

On the convergence of a regularized Jacobi algorithm for convex optimization

ECS171: Machine Learning

Coordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani

Big Data Analytics: Optimization and Randomization

Fast proximal gradient methods

Unconstrained optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Convex Optimization Algorithms for Machine Learning in 10 Slides

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Transcription:

Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17

Linear Convergence of Gradient-Based Methods Fitting most machine learning models involves optimization. Most common algorithm is gradient descent (GD) and variants: e.g., stochastic gradient, quasi-newton, coordinate descent, etc. 2 / 17

Linear Convergence of Gradient-Based Methods Fitting most machine learning models involves optimization. Most common algorithm is gradient descent (GD) and variants: e.g., stochastic gradient, quasi-newton, coordinate descent, etc. Standard global convergence rate result for GD: Smoothness + Strong-Convexity Linear Convergence Error on iteration k is O(ρ k ). 2 / 17

Linear Convergence of Gradient-Based Methods Fitting most machine learning models involves optimization. Most common algorithm is gradient descent (GD) and variants: e.g., stochastic gradient, quasi-newton, coordinate descent, etc. Standard global convergence rate result for GD: Smoothness + Strong-Convexity Linear Convergence Error on iteration k is O(ρ k ). But even simple models are often not strongly-convex. e.g., least-squares, logistic regression, etc. 2 / 17

Linear Convergence of Gradient-Based Methods Fitting most machine learning models involves optimization. Most common algorithm is gradient descent (GD) and variants: e.g., stochastic gradient, quasi-newton, coordinate descent, etc. Standard global convergence rate result for GD: Smoothness + Strong-Convexity Linear Convergence Error on iteration k is O(ρ k ). But even simple models are often not strongly-convex. e.g., least-squares, logistic regression, etc. This talk: How much can we relax strong-convexity? Smoothness +??? Strong-Convexity Linear Convergence 2 / 17

Polyak-Łojasiewicz Inequality Polyak [1963] showed linear convergence of GD assuming 1 2 f(x) 2 µ (f(x) f ), i.e., the gradient grows as a quadratic function of sub-optimality. 3 / 17

Polyak-Łojasiewicz Inequality Polyak [1963] showed linear convergence of GD assuming 1 2 f(x) 2 µ (f(x) f ), i.e., the gradient grows as a quadratic function of sub-optimality. Holds for strongly-convex problem, but also problems of the form f(x) = g(ax), for strongly-convex g. Includes least-squares, logistic regression (on compact set), etc. 3 / 17

Polyak-Łojasiewicz Inequality Polyak [1963] showed linear convergence of GD assuming 1 2 f(x) 2 µ (f(x) f ), i.e., the gradient grows as a quadratic function of sub-optimality. Holds for strongly-convex problem, but also problems of the form f(x) = g(ax), for strongly-convex g. Includes least-squares, logistic regression (on compact set), etc. A special case of Łojasiewicz inequality [1963]. We call this the Polyak-Łojasiewicz (PL) inequality. 3 / 17

Polyak-Łojasiewicz Inequality Polyak [1963] showed linear convergence of GD assuming 1 2 f(x) 2 µ (f(x) f ), i.e., the gradient grows as a quadratic function of sub-optimality. Holds for strongly-convex problem, but also problems of the form f(x) = g(ax), for strongly-convex g. Includes least-squares, logistic regression (on compact set), etc. A special case of Łojasiewicz inequality [1963]. We call this the Polyak-Łojasiewicz (PL) inequality. Using the PL inequality, we show Smoothness + PL Inequality Strong-Convexity Linear Convergence 3 / 17

Linear Convergence of GD under the PL Inequality Consider the basic unconstrained smooth optimization problem, min x IR d f(x), where f satisfies the PL inequality and f is Lipschitz continuous, f(y) f(x) + f(x), y x + L 2 y x 2. 4 / 17

Linear Convergence of GD under the PL Inequality Consider the basic unconstrained smooth optimization problem, min x IR d f(x), where f satisfies the PL inequality and f is Lipschitz continuous, f(y) f(x) + f(x), y x + L 2 y x 2. Applying GD with a constant step-size of 1/L, x k+1 = x k 1 L f(xk ), 4 / 17

Linear Convergence of GD under the PL Inequality Consider the basic unconstrained smooth optimization problem, min x IR d f(x), where f satisfies the PL inequality and f is Lipschitz continuous, f(y) f(x) + f(x), y x + L 2 y x 2. Applying GD with a constant step-size of 1/L, we have x k+1 = x k 1 L f(xk ), f(x k+1 ) f(x k ) + f(x k ), x k+1 x k + L 2 xk+1 x k 2 = f(x k ) 1 2L f(xk ) 2 f(x k ) µ [ f(x k ) f ]. L 4 / 17

Linear Convergence of GD under the PL Inequality Consider the basic unconstrained smooth optimization problem, min x IR d f(x), where f satisfies the PL inequality and f is Lipschitz continuous, f(y) f(x) + f(x), y x + L 2 y x 2. Applying GD with a constant step-size of 1/L, we have x k+1 = x k 1 L f(xk ), f(x k+1 ) f(x k ) + f(x k ), x k+1 x k + L 2 xk+1 x k 2 = f(x k ) 1 2L f(xk ) 2 f(x k ) µ [ f(x k ) f ]. L Subtracting f and applying recursively gives global linear rate, ( f(x k ) f 1 µ ) k [ f(x 0 ) f ]. L 4 / 17

Linear Convergence under the PL Inequality Proof is simple (simpler than with strong-convexity). 5 / 17

Linear Convergence under the PL Inequality Proof is simple (simpler than with strong-convexity). Does not require uniqueness of solution (unlike strong-convexity). 5 / 17

Linear Convergence under the PL Inequality Proof is simple (simpler than with strong-convexity). Does not require uniqueness of solution (unlike strong-convexity). Does not imply convexity (unlike strong-convexity). 5 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. QG: quadratic growth [Anitescu, 2000]. QG + convexity is optimal strong convexity [Liu & Wright, 2015]. 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. QG: quadratic growth [Anitescu, 2000]. QG + convexity is optimal strong convexity [Liu & Wright, 2015]. ESC: essential strong convexity [Liu et al., 2013]. 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. QG: quadratic growth [Anitescu, 2000]. QG + convexity is optimal strong convexity [Liu & Wright, 2015]. ESC: essential strong convexity [Liu et al., 2013]. RSI: restricted secant inequality [Zhang & Yin, 2013]. RSI + convexity is restricted strong convexity. 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. QG: quadratic growth [Anitescu, 2000]. QG + convexity is optimal strong convexity [Liu & Wright, 2015]. ESC: essential strong convexity [Liu et al., 2013]. RSI: restricted secant inequality [Zhang & Yin, 2013]. RSI + convexity is restricted strong convexity. WSC: weak strong convexity [Necoara et al., 2015]. Also sometimes used for QG + convexity. 6 / 17

Weaker Conditions than Strong Convexity How does the PL inequality [1963] relate to more recent conditions? EB: error bounds [Luo & Tseng, 1993]. QG: quadratic growth [Anitescu, 2000]. QG + convexity is optimal strong convexity [Liu & Wright, 2015]. ESC: essential strong convexity [Liu et al., 2013]. RSI: restricted secant inequality [Zhang & Yin, 2013]. RSI + convexity is restricted strong convexity. WSC: weak strong convexity [Necoara et al., 2015]. Also sometimes used for QG + convexity. Proofs are more complicated under these conditions. Are they more general? 6 / 17

Relationships Between Conditions Theorem For a function f with a Lipschitz-continuous gradient, we have: (SC) (ESC) (WSC) (RSI) (EB) (PL) (QG). 7 / 17

Relationships Between Conditions Theorem For a function f with a Lipschitz-continuous gradient, we have: (SC) (ESC) (WSC) (RSI) (EB) (PL) (QG). If we further assume that f is convex, then (RSI) (EB) (PL) (QG). 7 / 17

Relationships Between Conditions Theorem For a function f with a Lipschitz-continuous gradient, we have: (SC) (ESC) (WSC) (RSI) (EB) (PL) (QG). If we further assume that f is convex, then (RSI) (EB) (PL) (QG). QG is the weakest condition but allows non-global local minima. 7 / 17

Relationships Between Conditions Theorem For a function f with a Lipschitz-continuous gradient, we have: (SC) (ESC) (WSC) (RSI) (EB) (PL) (QG). If we further assume that f is convex, then (RSI) (EB) (PL) (QG). QG is the weakest condition but allows non-global local minima. PL EB are most general conditions. Allow linear convergence to global minimizer. 7 / 17

PL Inequality for Invex and Non-Convex Functions While PL inequality does not imply convexity, it implies invexity. 8 / 17

PL Inequality for Invex and Non-Convex Functions While PL inequality does not imply convexity, it implies invexity. For smooth f, invexity all stationary points are global optimum. Example of invex but non-convex function satisfying PL: f(x) = x 2 + 3 sin 2 (x) f(x) x 8 / 17

PL Inequality for Invex and Non-Convex Functions While PL inequality does not imply convexity, it implies invexity. For smooth f, invexity all stationary points are global optimum. Example of invex but non-convex function satisfying PL: f(x) = x 2 + 3 sin 2 (x) f(x) x Many important models don t satisfy invexity. 8 / 17

PL Inequality for Invex and Non-Convex Functions While PL inequality does not imply convexity, it implies invexity. For smooth f, invexity all stationary points are global optimum. Example of invex but non-convex function satisfying PL: f(x) = x 2 + 3 sin 2 (x) f(x) x Many important models don t satisfy invexity. For these problems we often divide analysis into two phases: Global convergence: iterations needed to get close to minimizer. Local convergence: how fast does it converge near the minimizer? 8 / 17

PL Inequality for Invex and Non-Convex Functions While PL inequality does not imply convexity, it implies invexity. For smooth f, invexity all stationary points are global optimum. Example of invex but non-convex function satisfying PL: f(x) = x 2 + 3 sin 2 (x) f(x) x Many important models don t satisfy invexity. For these problems we often divide analysis into two phases: Global convergence: iterations needed to get close to minimizer. Local convergence: how fast does it converge near the minimizer? Usually, local convergence assumes strong-convexity near minimizer. If we assume PL, then local convergence phase may be much earlier. 8 / 17

Convergence for Huge-Scale Methods For large datasets, we typically don t use GD. But the PL inequality can be used to analyze other algorithms. 9 / 17

Convergence for Huge-Scale Methods For large datasets, we typically don t use GD. But the PL inequality can be used to analyze other algorithms. We will use PL for coordinate descent and stochastic gradient. Garber & Hazan [2015] consider Franke-Wolfe. Reddi et al. [2016] consider other stochastic algorithms. In Karimi et al. [2016], we consider sign-based gradient methods. 9 / 17

Random and Greedy Coordinate Descent For randomized coordinate descent under PL we have [ E f(x k ) f ] ( 1 µ ) k [ f(x 0 ) f ], dl c where L c is coordinate-wise Lipschitz constant of f. Faster than GD rate if iterations are d times cheaper. 10 / 17

Random and Greedy Coordinate Descent For randomized coordinate descent under PL we have [ E f(x k ) f ] ( 1 µ ) k [ f(x 0 ) f ], dl c where L c is coordinate-wise Lipschitz constant of f. Faster than GD rate if iterations are d times cheaper. For greedy coordinate descent under PL we have a faster rate ( ) k f(x k ) f 1 µ1 [ f(x 0 ) f ], L c 10 / 17

Random and Greedy Coordinate Descent For randomized coordinate descent under PL we have [ E f(x k ) f ] ( 1 µ ) k [ f(x 0 ) f ], dl c where L c is coordinate-wise Lipschitz constant of f. Faster than GD rate if iterations are d times cheaper. For greedy coordinate descent under PL we have a faster rate ( ) k f(x k ) f 1 µ1 [ f(x 0 ) f ], L c where µ 1 is the PL constant in the L -norm, 1 2 f(x) 2 µ 1 (f(x) f ). 10 / 17

Random and Greedy Coordinate Descent For randomized coordinate descent under PL we have [ E f(x k ) f ] ( 1 µ ) k [ f(x 0 ) f ], dl c where L c is coordinate-wise Lipschitz constant of f. Faster than GD rate if iterations are d times cheaper. For greedy coordinate descent under PL we have a faster rate ( ) k f(x k ) f 1 µ1 [ f(x 0 ) f ], L c where µ 1 is the PL constant in the L -norm, 1 2 f(x) 2 µ 1 (f(x) f ). Gives rate for some boosting variants [Meir and Rätsch, 2003]. 10 / 17

Stochastic Gradient Methods Stochastic gradient (SG) methods apply to general problems argmin x R d f(x) = E[f i(x)], and we usually focus on the special case of a finite sum, f(x) = 1 n n f i(x). i=1 11 / 17

Stochastic Gradient Methods Stochastic gradient (SG) methods apply to general problems argmin x R d f(x) = E[f i(x)], and we usually focus on the special case of a finite sum, f(x) = 1 n n f i(x). i=1 SG methods use the iteration x k+1 = x k α k f ik (x k ), where f ik is an unbiased gradient approximation. 11 / 17

Stochastic Gradient Methods Theorem With α k = 2k+1 the SG method satisfies 2µ(k+1) 2 E [f(x k ) f ] Lσ2 2kµ, 2 12 / 17

Stochastic Gradient Methods Theorem With α k = 2k+1 the SG method satisfies 2µ(k+1) 2 E [f(x k ) f ] Lσ2 2kµ, 2 while with α k set to constant α we have [ E f(x k ) f ] (1 2µα) k [ f(x 0 ) f ] + Lσ2 α 4µ. 12 / 17

Stochastic Gradient Methods Theorem With α k = 2k+1 the SG method satisfies 2µ(k+1) 2 E [f(x k ) f ] Lσ2 2kµ, 2 while with α k set to constant α we have [ E f(x k ) f ] (1 2µα) k [ f(x 0 ) f ] + Lσ2 α 4µ. O(1/k) rate without strong-convexity (or even convexity). Fast reduction of sub-optimality under small constant step-size. 12 / 17

Stochastic Gradient Methods Theorem With α k = 2k+1 the SG method satisfies 2µ(k+1) 2 E [f(x k ) f ] Lσ2 2kµ, 2 while with α k set to constant α we have [ E f(x k ) f ] (1 2µα) k [ f(x 0 ) f ] + Lσ2 α 4µ. O(1/k) rate without strong-convexity (or even convexity). Fast reduction of sub-optimality under small constant step-size. Our work and Reddi et al. [2016] consider finite sum case: Analyze stochastic variance-reduced gradient (SVRG) method. Obtain linear convergence rates. 12 / 17

PL Generalizations for Non-Smooth Problems What can we say about non-smooth problems? Well-known generalization of PL is the KL inequality. 13 / 17

PL Generalizations for Non-Smooth Problems What can we say about non-smooth problems? Well-known generalization of PL is the KL inequality. Attouch & Bolte [2009] show linear rate for proximal-point. But proximal-gradient methods are more relevant for ML. 13 / 17

PL Generalizations for Non-Smooth Problems What can we say about non-smooth problems? Well-known generalization of PL is the KL inequality. Attouch & Bolte [2009] show linear rate for proximal-point. But proximal-gradient methods are more relevant for ML. KL inequality has been used to show local rate for this method. We propose a different PL generalization giving a simple global rate. 13 / 17

Proximal-PL Inequality Proximal-gradient methods apply to the problem argmin x IR d F (x) = f(x) + g(x), where f is L-Lipschitz and g is a potentially non-smooth convex function. E.g., l 1 -regularization, bound constraints, etc. 14 / 17

Proximal-PL Inequality Proximal-gradient methods apply to the problem argmin x IR d F (x) = f(x) + g(x), where f is L-Lipschitz and g is a potentially non-smooth convex function. E.g., l 1 -regularization, bound constraints, etc. We say that F satisfies the proximal-pl inequality if 1 2 Dg(x, L) µ (F (x) F ), 14 / 17

Proximal-PL Inequality Proximal-gradient methods apply to the problem argmin x IR d F (x) = f(x) + g(x), where f is L-Lipschitz and g is a potentially non-smooth convex function. E.g., l 1 -regularization, bound constraints, etc. We say that F satisfies the proximal-pl inequality if 1 2 Dg(x, L) µ (F (x) F ), where { D g(x, α) 2α min f(x), y x + α } y 2 y x 2 + g(y) g(x). 14 / 17

Proximal-PL Inequality Proximal-gradient methods apply to the problem argmin x IR d F (x) = f(x) + g(x), where f is L-Lipschitz and g is a potentially non-smooth convex function. E.g., l 1 -regularization, bound constraints, etc. We say that F satisfies the proximal-pl inequality if 1 2 Dg(x, L) µ (F (x) F ), where { D g(x, α) 2α min f(x), y x + α } y 2 y x 2 + g(y) g(x). Condition yields extremely-simple proof: F (x k+1 ) = f(x k+1 ) + g(x k ) + g(x k+1 ) g(x k ) F (x k ) + f(x k ), x k+1 x k + L 2 xk+1 x k 2 + g(x k+1 ) g(x k ) F (x k ) 1 2L Dg(xk, L) F (x k ) µ [ F (x k ) F ] ( F (x k ) F 1 µ ) k [F (x 0 ) F ] L L 14 / 17

Relevant Problems for Proximal-PL We also analyze proximal coordinate descent under PL. Reddi et al. [2016] analyze proximal-svrg and proximal-saga. 15 / 17

Relevant Problems for Proximal-PL We also analyze proximal coordinate descent under PL. Reddi et al. [2016] analyze proximal-svrg and proximal-saga. Proximal PL is satisfied when: f is strongly-convex. f satisfies PL and g is constant. f = h(ax) for strongly-convex h and g is indicator of polyhedral set. F is convex and satisfies QG. F satisfies the proximal-eb condition or KL inequality 15 / 17

Relevant Problems for Proximal-PL We also analyze proximal coordinate descent under PL. Reddi et al. [2016] analyze proximal-svrg and proximal-saga. Proximal PL is satisfied when: f is strongly-convex. f satisfies PL and g is constant. f = h(ax) for strongly-convex h and g is indicator of polyhedral set. F is convex and satisfies QG. F satisfies the proximal-eb condition or KL inequality Includes dual support vector machines (SVM) problem: Implies linear rate of SDCA for SVMs. 15 / 17

Relevant Problems for Proximal-PL We also analyze proximal coordinate descent under PL. Reddi et al. [2016] analyze proximal-svrg and proximal-saga. Proximal PL is satisfied when: f is strongly-convex. f satisfies PL and g is constant. f = h(ax) for strongly-convex h and g is indicator of polyhedral set. F is convex and satisfies QG. F satisfies the proximal-eb condition or KL inequality Includes dual support vector machines (SVM) problem: Implies linear rate of SDCA for SVMs. Includes l 1-regularized least-squares (LASSO) problem: No need for RIP, homotopy, modified restricted strong convexity,... 15 / 17

Summary In 1963, Polyak proposed a condition for linear rate of gradient descent. Gives trivial proof and is weaker than more recent conditions. Weakest condition that guarantees global minima. 16 / 17

Summary In 1963, Polyak proposed a condition for linear rate of gradient descent. Gives trivial proof and is weaker than more recent conditions. Weakest condition that guarantees global minima. We can use the inequality to analyze huge-scale methods: Coordinate descent, stochastic gradient, SVRG, etc. 16 / 17

Summary In 1963, Polyak proposed a condition for linear rate of gradient descent. Gives trivial proof and is weaker than more recent conditions. Weakest condition that guarantees global minima. We can use the inequality to analyze huge-scale methods: Coordinate descent, stochastic gradient, SVRG, etc. We give proximal-gradient generalization: Standard algorithms have linear rate for SVM and LASSO. 16 / 17

Thank you! 17 / 17