Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22

Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem: min f(x) x C Only handles simple constraints, e.g., bound constraints. Franke-Wolfe Algorithm: minimize linear function over C. Proximal-Gradient Methods Generalizes projected-gradient: min f(x) + r(x), x where f is smooth, r is general convex function (proximable). Dealing with r(x) = φ(ax) difficult, even when φ simple. Alternating Direction Method of Multipliers TODAY: We focus on coordinate descent, which is for the case where r is separable and f has some special structure. 2 / 22

Coordinate Descent Methods Suitable for large-scale optimization (dimension d is large): Certain smooth (unconstrained) problems. Non-smooth problems with separable constraints/regularizers. e.g., l 1-regularization, bound constraints Faster than gradient descent if iterations d times cheaper. 3 / 22

Problems Suitable for Coordinate Descent Coordinate update d times faster than gradient update for: d h 1 (x) = f(ax) + g i (x i ), or h 2 (x) = g i (x i ) + i=1 i V f and f ij smooth, convex A is a matrix {V, E} is a graph (i,j) E f ij (x i, x j ) g i general non-degenerate convex functions Examples h 1 : least squares, logistic regression, lasso, l 2 -norm SVMs. 1 d e.g., min x IR d 2 Ax b 2 + λ x i. Examples h 2 : quadratics, graph-based label prop., graphical models. 1 e.g., min x IR d 2 xt Ax + b T x = 1 d d d a ij x i x j + b i x i. 2 i=1 j=1 i=1 i=1 4 / 22

Notation and Assumptions We focus on the convex optimization problem min f(x) x IR d f coordinate-wise L-Lipschitz continuous f µ-strongly convex, i.e., i f(x + αe i ) i f(x) L α is convex for some µ > 0. x f(x) µ 2 x 2 If f is twice-differentiable, equivalent to 2 ii f(x) L, 2 f(x) µi. 5 / 22

Coordinate Descent vs. Gradient Descent x k+1 = x k 1 L i k f(x k )e ik x k+1 = x k α f(x k ) Global convergence rate for randomized i k selection [Nesterov]: ( E[f(x k+1 )] f(x ) 1 µ ) [f(x k ) f(x )] Ld Global convergence rate for gradient descent: ) f(x k+1 ) f(x ) (1 µlf [f(x k ) f(x )] Since Ld L f L, coordinate descent is slower per iteration, but d coordinate iterations are faster than one gradient iteration. 6 / 22

Proximal Coordinate Descent min x IR d F (x) f(x) + i g i (x i ) where f is smooth and g i might be non-smooth. e.g., l 1 -regularization, bound constraints Apply proximal-gradient style update, ] x k+1 = prox 1 [x L g k 1L i ik f(x k )e ik k where prox αg [y] = argmin 2 x y 2 + αg(x). Convergence for randomized i k : E[F (x k+1 )] F (x ) x IR d 1 ( 1 µ ) [ ] F (x k ) F (x ) dl 7 / 22

Sampling Rules Cyclic: Cycle through i in order, i.e., i 1 = 1, i 2 = 2, etc. Uniform random: Sample i k uniformly from {1, 2,..., d}. Lipschitz sampling: Sample i k proportional to L i. Gauss-Southwell: Select i k = argmax i i f(x k ). Cyclic Coordinate Descent Stochastic Coordinate Descent Gauss-Southwell-Lipschitz: Select i k = argmax i i f(x k ) Li. Wright (UW-Madison) Coordinate Descent Methods November 2014 6 / 59 Choose components i(j) randomly, independently at each iteration. Wright (UW-Madison) Coordinate Descent Methods November 2014 7 / 598 / 22

Gauss-Southwell Rules GSL: argmax i i f(x k ) Li GS: argmax i i f(x k ) Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell-Lipschitz Gauss-Southwell x1 x2 Feasible for problems where A is super sparse or for a graph with mean nneighbours approximately equals maximum nneighbours. Show GS and GSL up to d times faster than randomized by measuring strong convexity in the 1-norm or L-norm, respectively. 9 / 22

Exact Optimization x k+1 = x k α k ik f(x k )e ik, for some i k Exact coordinate optimization chooses the step size minimizing f: Alternatives: f(x k+1 ) = min α {f(x k α ik f(x k )e ik )} Line search: find α > 0 such that f(x k α ik f(x k )e ik ) < f(x k ). Select step size based on global knowledge of f, e.g., 1/L. 10 / 22

Stochastic Dual Coordinate Ascent Suitable for large-scale supervised learning (large # loss functions n): Primal formulated as sum of convex loss functions. Operates on the dual. Achieves faster linear rate than SGD for smooth loss functions. Theoretically equivalent to SSG for non-smooth loss functions. 11 / 22

The Big Picture... Stochastic Gradient Descent (SGD): Strong theoretical guarantees. Hard to tune step size (requires α 0). No clear stopping criterion (Stochastic Sub-Gradient method (SSG)). Converges fast at first, then slow to more accurate solution. Stochastic Dual Coordinate Ascent (SDCA): Strong theoretical guarantees that are comparable to SGD. Easy to tune step size (line search). Terminate when the duality gap is sufficiently small. Converges to accurate solution faster than SGD. 12 / 22

Primal Problem (P) min w IR d P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 where x 1,..., x n vectors in IR d, φ 1,..., φ n sequence of scalar convex functions, λ > 0 regularization parameter. Examples: (for given labels y 1,..., y n { 1, 1}) SVMs: φ i (a) = max{0, 1 y i a} (L-Lipschitz) Regularized logistic regression: φ i (a) = log(1 + exp( y i a)) Ridge regression: φ i (a) = (a y i ) 2 (smooth) Regression: φ i (a) = a y i Support vector regression: φ i (a) = max{0, a y i ν} 13 / 22

Dual Problem (D) (P) max α IR n min w IR d D(α) = 1 n P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 n φ i ( α i ) λ 1 2 λn i=1 n 2 α i x i i=1 where φ i (u) = max z(zu φ i (z)) is the convex conjugate of φ i. Different dual variable associated with each example in training set. 14 / 22

Duality Gap (D) (P) max α IR n min w IR d D(α) = 1 n P (w) = 1 n n φ i (w T x i ) + λ 2 w 2 i=1 n φ i ( α i ) λ 1 2 λn i=1 n 2 α i x i i=1 Define w(α) = 1 λn n i=1 α ix i, then it is known that w(α ) = w. P (w ) = D(α ), which implies P (w) D(α) for all w, α. Duality gap is defined by P (w(α)) D(α): Upper bound on the primal sub-optimality: P (w(α)) P (w ). 15 / 22

SDCA Algorithm (1) Select a training example i at random. (2) Do exact line search in the dual, i.e., find α i : maximize φ i ( (α(t 1) i + α i )) λn 2 w(t 1) +(λn) 1 α i x i 2 (3) Update the dual variable α (t) and the primal variable w (t) : α (t) α (t 1) + α i e i w (t) w (t 1) + (λn) 1 α i x i Terminate when duality gap is sufficiently small. There are ways to get the rate without a line search that use the primal gradient/subgradient directions. 16 / 22

SGD vs. SDCA Alternative to SGD/SSG. If primal is smooth, get faster linear rate on duality gap than SGD. If primal is non-smooth, get sublinear rate on duality gap. SDCA has similar update to SSG on primal. SSG sensitive to step-size. Do line search in the dual with coordinate ascent. SDCA may not perform as well as SGD for first few epochs (full pass) SGD takes larger step size than SDCA earlier on, helps performance. Using modified SGD on first epoch followed by SDCA obtains faster convergence when regularization parameter λ >> log(n)/n. 17 / 22

Comparison of Rates Lipschitz loss function (e.g., hinge-loss, φ i (a) = max{0, 1 y i a}): Algorithm convergence type rate SGD primal Õ(1/(λε p)) online EG (Collins et al., 2008) (for SVM) dual Õ(n/ε d ) Stochastic Frank-Wolfe (Lacoste-Julien et al., 2012) primal-dual Õ(n + 1/(λε)) SDCA primal-dual Õ(n + 1/(λε)) or faster Smooth loss function (e.g., ridge-regression, φ i (a) = (a y i ) 2 ): Algorithm convergence type rate SGD primal Õ(1/(λε p)) online EG (Collins et al., 2008) (for LR) dual Õ((n + 1/λ) log(1/ε d )) SAG (Le Roux et al., 2012) (assuming n 8/(λγ)) primal Õ((n + 1/λ) log(1/ε p)) SDCA primal-dual Õ((n + 1/λ) log(1/ε)) Even if α is ε d -sub-optimal in the dual, i.e., D(α) D(α ) ε d, the primal solution w(α) might be far from optimal. Bound on duality-gap is upper bound on primal sub-optimality. Recent results have shown improvements upon some of the rates in the above tables. 18 / 22

Accelerated Coordinate Descent Inspired by Nesterov s accelerated gradient method. Uses multi-step strategy, carries momentum from previous iterations. For accelerated randomized coordinate descent: e.g., for a convex function: O(1/k 2 ) rate, instead of O(1/k). 19 / 22

Block Coordinate Descent x k+1 = x k 1 L b k f(x k )e bk, for some block of indices b k Search along coordinate hyperplane. Fixed blocks, adaptive blocks. Randomized/proximal CD easily extended to the block case. For proximal case, choice of block must be consistent with block-separable structure of regularization function g. 20 / 22

Parallel Coordinate Descent Synchronous parallelism: Divide iterate updates between processors (block), followed by synchronization step. Asynchronous parallelism: Each processor: Has access to x. Chooses an index i, loads components of x that are needed to compute the gradient component if(x), then updates the ith component x i. No attempt to coordinate or synchronize with other processors. Always using stale x: convergence results restrict how stale. 21 / 22

Discussion Coordinate Descent: Suitable for large-scale optimization (when d is large). Operates on the primal objective. Faster than gradient descent if iterations d times cheaper. Stochastic Dual Coordinate Ascent: Suitable for large-scale optimization (when n is large). Operates on the dual objective. If primal is smooth, obtains faster linear rate on duality gap than SGD. If primal is non-smooth, obtain sublinear rate on duality gap. Do line search in the dual with coordinate ascent. Outperforms SGD when relatively high solution accuracy is required. Terminate when duality-gap is sufficiently small. Variations: acceleration, block, parallel. 22 / 22