Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Similar documents
Accelerating Stochastic Optimization

Coordinate Descent and Ascent Methods

Mini-Batch Primal and Dual Methods for SVMs

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

SVRG++ with Non-uniform Sampling

Fast Stochastic Optimization Algorithms for ML

Big Data Analytics: Optimization and Randomization

Coordinate descent methods

Adaptive Probabilities in Stochastic Optimization Algorithms

Convex Optimization Lecture 16

Stochastic Gradient Descent with Variance Reduction

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Trade-Offs in Distributed Learning and Optimization

Stochastic and online algorithms

Lecture 3: Minimizing Large Sums. Peter Richtárik

Accelerating SVRG via second-order information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Optimization for Machine Learning

Coordinate Descent Faceoff: Primal or Dual?

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Nesterov s Acceleration

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

A Parallel SGD method with Strong Convergence

Inverse Time Dependency in Convex Regularized Learning

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

arxiv: v1 [math.oc] 1 Jul 2016

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic Methods for l 1 Regularized Loss Minimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Distributed Optimization with Arbitrary Local Solvers

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Optimization Methods for Machine Learning

Linear Convergence under the Polyak-Łojasiewicz Inequality

CS260: Machine Learning Algorithms

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Introduction to Machine Learning (67577) Lecture 7

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Machine Learning in the Data Revolution Era

Adding vs. Averaging in Distributed Primal-Dual Optimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Classification Logistic Regression

Stochastic optimization: Beyond stochastic gradients and convexity Part I

A Distributed Solver for Kernelized SVM

Stochastic Gradient Descent with Only One Projection

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Large-scale Stochastic Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

arxiv: v2 [cs.lg] 3 Jul 2015

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

ECS171: Machine Learning

Comparison of Modern Stochastic Optimization Algorithms

Optimistic Rates Nati Srebro

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Cutting Plane Training of Structural SVM

Inverse Time Dependency in Convex Regularized Learning

A Universal Catalyst for Gradient-Based Optimization

Introduction to Machine Learning (67577)

Stochastic Optimization

Beyond stochastic gradient descent for large-scale machine learning

Importance Sampling for Minibatches

ONLINE VARIANCE-REDUCING OPTIMIZATION

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

arxiv: v2 [stat.ml] 16 Jun 2015

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Active Learning: Disagreement Coefficient

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Support Vector Machines and Kernel Methods

Advanced Topics in Machine Learning

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

ICS-E4030 Kernel Methods in Machine Learning

Primal and Dual Variables Decomposition Methods in Convex Optimization

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

arxiv: v2 [cs.lg] 10 Oct 2018

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Probabilistic Graphical Models

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Neural Networks: Backpropagation

Transcription:

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine Learning, NIPS Workshop, 2012 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 1 / 20

Regularized Loss Minimization min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2. i=1 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 2 / 20

Regularized Loss Minimization min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2. i=1 Examples: φ i (z) Lipschitz smooth SVM max{0, 1 y i z} Logistic regression log(1 + exp( y i z)) Abs-loss regression z y i Square-loss regression (z y i ) 2 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 2 / 20

Dual Coordinate Ascent (DCA) Primal problem: min P (w) := w [ 1 n ] n φ i (w x i ) + λ 2 w 2 i=1 Dual problem: max D(α) := α Rn 1 n n φ i ( α i ) λ 2 i=1 1 λn n i=1 2 α i x i DCA: At each iteration, optimize D(α) w.r.t. a single coordinate, while the rest of the coordinates are kept in tact. Stochastic Dual Coordinate Ascent (SDCA): Choose the updated coordinate uniformly at random Shalev-Shwartz & Zhang (HU) SDCA OPT 12 3 / 20

SDCA vs. SGD update rule Stochastic Gradient Descent (SGD) update rule: SDCA update rule: w (t+1) = ( 1 1 t ) w (t) φ i (w(t) x i ) λ t 1. α (t+1) = argmax D(α (t) + e i ) R 2. w (t+1) = w (t) + λ n x i x i Rather similar update rules. SDCA has several advantages: Stopping criterion No need to tune learning rate Shalev-Shwartz & Zhang (HU) SDCA OPT 12 4 / 20

SDCA vs. SGD update rule Example SVM with the hinge loss: φ i (w) = max{0, 1 y i w x i } SGD update rule: SDCA update rule: ( 1. = y i max w (t+1) = ( 1 1 t 0, min 1. α (t+1) = α (t) + e i ( 2. w (t+1) = w (t) + λ n x i ) w (t) 1[y i x i w(t) < 1] λ t 1, 1 y i x i w(t 1) x i 2 2 /(λn) x i + y i α (t 1) i )) α (t 1) i Shalev-Shwartz & Zhang (HU) SDCA OPT 12 5 / 20

SDCA vs. SGD experimental observations On CCAT dataset, λ = 10 6, smoothed loss 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 6 / 20

SDCA vs. SGD experimental observations On CCAT dataset, λ = 10 5, hinge-loss 10 0 SDCA SDCA Perm SGD 10 1 10 2 10 3 10 4 10 5 10 6 5 10 15 20 25 30 35 Shalev-Shwartz & Zhang (HU) SDCA OPT 12 7 / 20

SDCA vs. SGD Current analysis is unsatisfactory How many iterations are required to guarantee P (w (t) ) P (w ) + ɛ? For SGD: Õ ( ) 1 λ ɛ For SDCA: Hsieh et al. (ICML 2008), following Luo and Tseng (1992): O ( 1 ν log(1/ɛ)), but, ν can be arbitrarily small S and Tewari (2009), Nesterov (2010): O(n/ɛ) for general n-dimensional coordinate ascent Can apply it to the dual problem Resulting rate is slower than SGD And, the analysis does not hold for logistic regression (it requires smooth dual) Analysis is for dual sub-optimality Shalev-Shwartz & Zhang (HU) SDCA OPT 12 8 / 20

Dual vs. Primal sub-optimality Take data which is linearly separable using a vector w 0 Set λ = 2ɛ/ w 0 2 and use the hinge-loss P (w ) P (w 0 ) = ɛ D(0) = 0 D(α ) D(0) = P (w ) D(0) ɛ But, w(0) = 0 so P (w(0)) P (w ) = 1 P (w ) 1 ɛ Conclusion: In the interesting regime, ɛ-sub-optimality on the dual can be meaningless w.r.t. the primal! Shalev-Shwartz & Zhang (HU) SDCA OPT 12 9 / 20

Our results For (1/γ)-smooth loss: (( Õ n + 1 ) log 1 ) λ ɛ For L-Lipschitz loss: ( Õ n + 1 ) λ ɛ For almost smooth loss functions (e.g. the hinge-loss): ( ) 1 Õ n + λ ɛ 1/(1+ν) where ν > 0 is a data dependent quantity Shalev-Shwartz & Zhang (HU) SDCA OPT 12 10 / 20

SDCA vs. DCA Randomization is crucial On CCAT dataset, λ = 10 4, smoothed hinge-loss 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 14 16 18 In particular, the bound of Luo and Tseng holds for cyclic order, hence must be inferior to our bound Shalev-Shwartz & Zhang (HU) SDCA OPT 12 11 / 20

Smoothing the hinge-loss 0 x > 1 φ(x) = 1 x γ/2 x < 1 γ (1 x)2 o.w. 1 2γ Shalev-Shwartz & Zhang (HU) SDCA OPT 12 12 / 20

Smoothing the hinge-loss Mild effect on 0-1 error astro-ph CCAT cov1 0.036 0.0512 0.2285 0-1 error 0.0359 0.0358 0.0357 0.0356 0.0511 0.051 0.0509 0.0508 0.0507 0.228 0.2275 0.227 0.0506 0.0355 0.0505 0.2265 0.0354 0.0504 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0503 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.226 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 γ γ γ Shalev-Shwartz & Zhang (HU) SDCA OPT 12 13 / 20

10 1 γ=0.000 10 1 γ=0.000 10 0 γ=0.000 Smoothing the hinge-loss Improves training time astro-ph CCAT cov1 10 2 γ=0.010 γ=0.100 γ=1.000 10 2 γ=0.010 γ=0.100 γ=1.000 10 1 γ=0.010 γ=0.100 γ=1.000 10 2 10 3 10 3 10 3 10 4 10 4 10 4 10 5 10 5 10 5 10 6 0 20 40 60 80 100 120 140 160 10 6 0 10 20 30 40 50 60 70 80 90 100 10 6 0 10 20 30 40 50 60 Duality gap as a function of runtime for different smoothing parameters Shalev-Shwartz & Zhang (HU) SDCA OPT 12 14 / 20

SDCA vs. DCA Randomization is crucial On CCAT dataset, λ = 10 4, smoothed hinge-loss 10 0 SDCA DCA Cyclic 10 1 SDCA Perm Bound 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 10 12 14 16 18 In particular, the bound of Luo and Tseng holds for cyclic order, hence must be inferior to our bound Shalev-Shwartz & Zhang (HU) SDCA OPT 12 15 / 20

Additional related work Collins et al (2008): For smooth loss, similar bound to ours (for smooth loss) but for a more complicated algorithm (Exponentiated Gradient on dual) Lacoste-Julien, Jaggi, Schmidt, Pletscher (preprint on Arxiv): Study Frank-Wolfe algorithm for the dual of structured prediction problems. Boils down to SDCA for the case of binary hinge-loss. Same bound as our bound for the Lipschitz case Le Roux, Schmidt, Bach (NIPS 2012): A variant of SGD for smooth loss and finite sample. Also obtain log(1/ɛ). Shalev-Shwartz & Zhang (HU) SDCA OPT 12 16 / 20

Extensions Slightly better rates for SDCA with SGD initialization Proximal Stochastic Dual Coordinate Ascent (in preparation, a preliminary version is on Arxiv) Solve: ] min P (w) := w [ 1 n n φ i (Xi w) + λg(w) i=1 For example: g(w) = 1 2 w 2 2 + σ λ w 1 leads to thresholding operator For example: φ i : R k R, φ i (v) = max y δ(y i, y ) + v yi v y is useful for structured output prediction Approximate dual maximization (can obtain closed form approximate solutions while still maintaining same guarantees on the convergence rate) Shalev-Shwartz & Zhang (HU) SDCA OPT 12 17 / 20

Proof Idea Main lemma: for any t and s [0, 1], E[D(α (t) ) D(α (t 1) )] s ( s 2 G n E[P (w(t 1) ) D(α (t 1) (t) )] n) 2λ G (t) = O(1) for Lipschitz losses With appropriate s, G (t) 0 for smooth losses Shalev-Shwartz & Zhang (HU) SDCA OPT 12 18 / 20

Proof Idea Main lemma: for any t and s [0, 1], E[D(α (t) ) D(α (t 1) )] s ( s 2 G n E[P (w(t 1) ) D(α (t 1) (t) )] n) 2λ Bounding dual sub-optimality: Since P (w (t 1) ) D(α ), the above lemma yields a convergence rate for the dual sub-optimality Bounding duality gap: Summing the inequality for iterations T 0 + 1,..., T and choosing a random t {T 0 + 1,..., T } yields, [ ] E (P (w (t 1) ) D(α (t 1) )) n s(t T 0 ) E[D(α(T ) ) D(α (T 0) )]+ s G 2λn Shalev-Shwartz & Zhang (HU) SDCA OPT 12 19 / 20

Summary SDCA works very well in practice So far, theoretical guarantees were unsatisfactory Our analysis shows that SDCA is an excellent choice in many scenarios Shalev-Shwartz & Zhang (HU) SDCA OPT 12 20 / 20