Coordinate Descent and Ascent Methods

Similar documents
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Convex Optimization Lecture 16

Parallel Coordinate Optimization

Linear Convergence under the Polyak-Łojasiewicz Inequality

Conditional Gradient (Frank-Wolfe) Method

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Coordinate descent methods

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Linear Convergence under the Polyak-Łojasiewicz Inequality

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

CSC 411 Lecture 17: Support Vector Machine

Big Data Analytics: Optimization and Randomization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization Methods for Machine Learning

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Accelerating Stochastic Optimization

Stochastic and online algorithms

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Lecture 23: November 21

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Introduction to Logistic Regression and Support Vector Machine

SVRG++ with Non-uniform Sampling

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Adaptive Probabilities in Stochastic Optimization Algorithms

ICS-E4030 Kernel Methods in Machine Learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Selected Topics in Optimization. Some slides borrowed from

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Optimization for Machine Learning

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Nesterov s Acceleration

Classification Logistic Regression

Optimization methods

Fast Stochastic Optimization Algorithms for ML

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

On Nesterov s Random Coordinate Descent Algorithms - Continued

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

CPSC 540: Machine Learning

arxiv: v1 [math.oc] 1 Jul 2016

Primal-dual coordinate descent

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Optimized first-order minimization methods

Stochastic Gradient Descent with Variance Reduction

Mini-Batch Primal and Dual Methods for SVMs

CS-E4830 Kernel Methods in Machine Learning

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Lecture 5: September 12

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Numerical Optimization

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Convex Optimization Algorithms for Machine Learning in 10 Slides

ECS171: Machine Learning

Coordinate Descent Methods on Huge-Scale Optimization Problems

Linearly-Convergent Stochastic-Gradient Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 25: November 27

Discriminative Models

Chapter 2. Optimization. Gradients, convexity, and ALS

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Comparison of Modern Stochastic Optimization Algorithms

Homework 4. Convex Optimization /36-725

Lecture 16: FTRL and Online Mirror Descent

Nonlinear Optimization: What s important?

The Frank-Wolfe Algorithm:

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Optimization methods

CS260: Machine Learning Algorithms

Machine Learning. Support Vector Machines. Manfred Huber

12. Interior-point methods

Large-scale Stochastic Optimization

Stochastic Optimization Algorithms Beyond SG

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Sub-Sampled Newton Methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Trade-Offs in Distributed Learning and Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Lecture 23: November 19

Math 273a: Optimization Subgradient Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Discriminative Models

Constrained Optimization and Lagrangian Duality

Transcription:

Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22

Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem: min f(x) x C Only handles simple constraints, e.g., bound constraints. Franke-Wolfe Algorithm: minimize linear function over C. Proximal-Gradient Methods Generalizes projected-gradient: min f(x) + r(x), x where f is smooth, r is general convex function (proximable). Dealing with r(x) = φ(ax) difficult, even when φ simple. Alternating Direction Method of Multipliers TODAY: We focus on coordinate descent, which is for the case where r is separable and f has some special structure. 2 / 22

Coordinate Descent Methods Suitable for large-scale optimization (dimension d is large): Certain smooth (unconstrained) problems. Non-smooth problems with separable constraints/regularizers. e.g., l 1-regularization, bound constraints Faster than gradient descent if iterations d times cheaper. 3 / 22

Problems Suitable for Coordinate Descent Coordinate update d times faster than gradient update for: d h 1 (x) = f(ax) + g i (x i ), or h 2 (x) = g i (x i ) + i=1 i V f and f ij smooth, convex A is a matrix {V, E} is a graph (i,j) E f ij (x i, x j ) g i general non-degenerate convex functions Examples h 1 : least squares, logistic regression, lasso, l 2 -norm SVMs. 1 d e.g., min x IR d 2 Ax b 2 + λ x i. Examples h 2 : quadratics, graph-based label prop., graphical models. 1 e.g., min x IR d 2 xt Ax + b T x = 1 d d d a ij x i x j + b i x i. 2 i=1 j=1 i=1 i=1 4 / 22

Notation and Assumptions We focus on the convex optimization problem min f(x) x IR d f coordinate-wise L-Lipschitz continuous f µ-strongly convex, i.e., i f(x + αe i ) i f(x) L α is convex for some µ > 0. x f(x) µ 2 x 2 If f is twice-differentiable, equivalent to 2 ii f(x) L, 2 f(x) µi. 5 / 22

Coordinate Descent vs. Gradient Descent x k+1 = x k 1 L i k f(x k )e ik x k+1 = x k α f(x k ) Global convergence rate for randomized i k selection [Nesterov]: ( E[f(x k+1 )] f(x ) 1 µ ) [f(x k ) f(x )] Ld Global convergence rate for gradient descent: ) f(x k+1 ) f(x ) (1 µlf [f(x k ) f(x )] Since Ld L f L, coordinate descent is slower per iteration, but d coordinate iterations are faster than one gradient iteration. 6 / 22

Proximal Coordinate Descent min x IR d F (x) f(x) + i g i (x i ) where f is smooth and g i might be non-smooth. e.g., l 1 -regularization, bound constraints Apply proximal-gradient style update, ] x k+1 = prox 1 [x L g k 1L i ik f(x k )e ik k where prox αg [y] = argmin 2 x y 2 + αg(x). Convergence for randomized i k : E[F (x k+1 )] F (x ) x IR d 1 ( 1 µ ) [ ] F (x k ) F (x ) dl 7 / 22

Sampling Rules Cyclic: Cycle through i in order, i.e., i 1 = 1, i 2 = 2, etc. Uniform random: Sample i k uniformly from {1, 2,..., d}. Lipschitz sampling: Sample i k proportional to L i. Gauss-Southwell: Select i k = argmax i i f(x k ). Cyclic Coordinate Descent Stochastic Coordinate Descent Gauss-Southwell-Lipschitz: Select i k = argmax i i f(x k ) Li. Wright (UW-Madison) Coordinate Descent Methods November 2014 6 / 59 Choose components i(j) randomly, independently at each iteration. Wright (UW-Madison) Coordinate Descent Methods November 2014 7 / 598 / 22

Gauss-Southwell Rules GSL: argmax i i f(x k ) Li GS: argmax i i f(x k ) Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell-Lipschitz Gauss-Southwell x1 x2 Feasible for problems where A is super sparse or for a graph with mean nneighbours approximately equals maximum nneighbours. Show GS and GSL up to d times faster than randomized by measuring strong convexity in the 1-norm or L-norm, respectively. 9 / 22

Exact Optimization x k+1 = x k α k ik f(x k )e ik, for some i k Exact coordinate optimization chooses the step size minimizing f: Alternatives: f(x k+1 ) = min α {f(x k α ik f(x k )e ik )} Line search: find α > 0 such that f(x k α ik f(x k )e ik ) < f(x k ). Select step size based on global knowledge of f, e.g., 1/L. 10 / 22

Stochastic Dual Coordinate Ascent Suitable for large-scale supervised learning (large # loss functions n): Primal formulated as sum of convex loss functions. Operates on the dual. Achieves faster linear rate than SGD for smooth loss functions. Theoretically equivalent to SSG for non-smooth loss functions. 11 / 22

The Big Picture... Stochastic Gradient Descent (SGD): Strong theoretical guarantees. Hard to tune step size (requires α 0). No clear stopping criterion (Stochastic Sub-Gradient method (SSG)). Converges fast at first, then slow to more accurate solution. Stochastic Dual Coordinate Ascent (SDCA): Strong theoretical guarantees that are comparable to SGD. Easy to tune step size (line search). Terminate when the duality gap is sufficiently small. Converges to accurate solution faster than SGD. 12 / 22

Primal Problem (P) min w IR d P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 where x 1,..., x n vectors in IR d, φ 1,..., φ n sequence of scalar convex functions, λ > 0 regularization parameter. Examples: (for given labels y 1,..., y n { 1, 1}) SVMs: φ i (a) = max{0, 1 y i a} (L-Lipschitz) Regularized logistic regression: φ i (a) = log(1 + exp( y i a)) Ridge regression: φ i (a) = (a y i ) 2 (smooth) Regression: φ i (a) = a y i Support vector regression: φ i (a) = max{0, a y i ν} 13 / 22

Dual Problem (D) (P) max α IR n min w IR d D(α) = 1 n P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 n φ i ( α i ) λ 1 2 λn i=1 n 2 α i x i i=1 where φ i (u) = max z(zu φ i (z)) is the convex conjugate of φ i. Different dual variable associated with each example in training set. 14 / 22

Duality Gap (D) (P) max α IR n min w IR d D(α) = 1 n P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 n φ i ( α i ) λ 1 2 λn i=1 n 2 α i x i i=1 Define w(α) = 1 λn n i=1 α ix i, then it is known that w(α ) = w. P (w ) = D(α ), which implies P (w) D(α) for all w, α. Duality gap is defined by P (w(α)) D(α): Upper bound on the primal sub-optimality: P (w(α)) P (w ). 15 / 22

SDCA Algorithm (1) Select a training example i at random. (2) Do exact line search in the dual, i.e., find α i : maximize φ i ( (α(t 1) i + α i )) λn 2 w(t 1) +(λn) 1 α i x i 2 (3) Update the dual variable α (t) and the primal variable w (t) : α (t) α (t 1) + α i e i w (t) w (t 1) + (λn) 1 α i x i Terminate when duality gap is sufficiently small. There are ways to get the rate without a line search that use the primal gradient/subgradient directions. 16 / 22

SGD vs. SDCA Alternative to SGD/SSG. If primal is smooth, get faster linear rate on duality gap than SGD. If primal is non-smooth, get sublinear rate on duality gap. SDCA has similar update to SSG on primal. SSG sensitive to step-size. Do line search in the dual with coordinate ascent. SDCA may not perform as well as SGD for first few epochs (full pass) SGD takes larger step size than SDCA earlier on, helps performance. Using modified SGD on first epoch followed by SDCA obtains faster convergence when regularization parameter λ >> log(n)/n. 17 / 22

Comparison of Rates Lipschitz loss function (e.g., hinge-loss, φ i (a) = max{0, 1 y i a}): Algorithm convergence type rate SGD primal Õ(1/(λε p)) online EG (Collins et al., 2008) (for SVM) dual Õ(n/ε d ) Stochastic Frank-Wolfe (Lacoste-Julien et al., 2012) primal-dual Õ(n + 1/(λε)) SDCA primal-dual Õ(n + 1/(λε)) or faster Smooth loss function (e.g., ridge-regression, φ i (a) = (a y i ) 2 ): Algorithm convergence type rate SGD primal Õ(1/(λε p)) online EG (Collins et al., 2008) (for LR) dual Õ((n + 1/λ) log(1/ε d )) SAG (Le Roux et al., 2012) (assuming n 8/(λγ)) primal Õ((n + 1/λ) log(1/ε p)) SDCA primal-dual Õ((n + 1/λ) log(1/ε)) Even if α is ε d -sub-optimal in the dual, i.e., D(α) D(α ) ε d, the primal solution w(α) might be far from optimal. Bound on duality-gap is upper bound on primal sub-optimality. Recent results have shown improvements upon some of the rates in the above tables. 18 / 22

Accelerated Coordinate Descent Inspired by Nesterov s accelerated gradient method. Uses multi-step strategy, carries momentum from previous iterations. For accelerated randomized coordinate descent: e.g., for a convex function: O(1/k 2 ) rate, instead of O(1/k). 19 / 22

Block Coordinate Descent x k+1 = x k 1 L b k f(x k )e bk, for some block of indices b k Search along coordinate hyperplane. Fixed blocks, adaptive blocks. Randomized/proximal CD easily extended to the block case. For proximal case, choice of block must be consistent with block-separable structure of regularization function g. 20 / 22

Parallel Coordinate Descent Synchronous parallelism: Divide iterate updates between processors (block), followed by synchronization step. Asynchronous parallelism: Each processor: Has access to x. Chooses an index i, loads components of x that are needed to compute the gradient component if(x), then updates the ith component x i. No attempt to coordinate or synchronize with other processors. Always using stale x: convergence results restrict how stale. 21 / 22

Discussion Coordinate Descent: Suitable for large-scale optimization (when d is large). Operates on the primal objective. Faster than gradient descent if iterations d times cheaper. Stochastic Dual Coordinate Ascent: Suitable for large-scale optimization (when n is large). Operates on the dual objective. If primal is smooth, obtains faster linear rate on duality gap than SGD. If primal is non-smooth, obtain sublinear rate on duality gap. Do line search in the dual with coordinate ascent. Outperforms SGD when relatively high solution accuracy is required. Terminate when duality-gap is sufficiently small. Variations: acceleration, block, parallel. 22 / 22