Parallel Coordinate Optimization

Size: px

Start display at page:

Download "Parallel Coordinate Optimization"

Candice Delphia McDonald
5 years ago
Views:

1 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018

2 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D

3 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k 2 / 38

4 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+2 x k 2 / 38

5 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

6 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

7 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

8 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x x k+3 x k+2 x k 2 / 38

9 3 / 38 Coordinate Descent Update a single coordinate at each iteration, i x k i α k f i (x k ) Easy to implement, low memory requirements, cheap iteration costs. Suitable for large-scale optimization (dimension n is large): Certain smooth (unconstrained) problems. Non-smooth problems with separable constraints/regularizers. e.g., l 1-regularization, bound constraints Faster than gradient descent if iterations n times cheaper. Adaptable to distributed settings. For truly huge-scale problems, it is absolutely necessary to parallelize.

10 4 / 38 Problem Consider the optimization problem where min F (x) := f(x) + g(x), x IR n f is loss function convex (smooth or nonsmooth) g is regularizer convex (smooth or nonsmooth), separable

11 5 / 38 Regularizer Examples g(x) = n g i (x i ), i=1 x = (x 1, x 2,..., x n ) T No regularizer: g i (x i ) 0 Weighted L1-norm: g i (x i ) = λ i x i (λ i > 0) e.g., LASSO Weighted L2-norm: g i (x i ) = λ i (x i ) 2 (λ i > 0) 0, x i X i, Box constraints: g i (x i ) = +, otherwise. e.g., SVM dual

12 6 / 38 Loss Examples Name f(x) References 1 Quadratic loss 2 Ax y 2 2 = 1 m 2 j=1 (A j:x y j ) 2 Bradley et al., 2011, Logistic loss m j=1 log(1 + exp( y ja j: x)) Richtárik & Takáč, 2011b, 2013a, Square hinge loss 1 2 (max{0, 1 y ja j: x}) 2 ) Takáč et al., 2013 L-infinity Ax y = max 1 j m A j: x y j L1-regression Ax y 1 = m j=1 A j:x y j Fercoq & Richtárik, 2013 ( Exponential loss log 1 ) m m j=1 exp(y ja j: x)

13 7 / 38 Parallel Coordinate Descent Embarrassingly parallel if objective is separable. Speedup equal to number of processors, τ. For partially-separable objectives: Assign i th processor task of updating i th component of x. Each processor communicates respective x + i to processors that require it. The i th processor needs current value of x j only if i f or 2 ii f depends on x j. Parallel implementations suitable when dependency graph is sparse.

14 Dependency Graph Given a fixed serial ordering of updates, those in red can be done in parallel. (i, j) is an arc of the dependency graph iff update function h j depends on x i Update order: {1, 2, 3, 4} 1 = h 1(x k 1, x k 3) 2 = h 2( 1, x k 2) 3 = h 3( 2, x k 3, x k 4) 4 = h 4( 2, x k 4) Better update order: {1, 3, 4, 2} 1 = h 1(x k 1, x k 3) 3 = h 3(x k 2, x k 3, x k 4) 4 = h 4(x k 2, x k 4) 2 = h 2( 1, x k 2) 8 / 38

15 Parallel Coordinate Descent Synchronous parallelism: Divide iterate updates between processors, followed by synchronization step. Very slow for large-scale problems (wait for slowest processor). Asynchronous parallelism: Each processor has access to x, chooses index i, loads components of x that are needed to compute the gradient component i f(x), then updates the ith component x i. No attempt to coordinate or synchronize with other processors. Always using stale x: convergence results restrict how stale. Many numerical results actually use asynchronous implementation, ignore synchronization step required by theory. 9 / 38

16 Totally Asynchronous Algorithm Definition: An algorithm is totally asynchronous if 1 each index i {1, 2,..., n} of x is updated at infinitely many iterations, and 2 if νj k denotes the iteration at which component j of the vector ˆxk was last updated, then νj k as k for all j = 1, 2,..., n. No condition on how stale ˆx j is, just requires that it will be updated eventually. Theorem (Bertsekas and Tsitsiklis, 1989) Suppose a mapping T (x) := x α f(x) for some α > 0 satisfies T (x) x η x x, for some η (0, 1). Then if we set α k α in Algorithm 7, the sequence {x k } converges to x. 10 / 38

17 11 / 38 Partly Asynchronous Algorithm No convergence rate for totally asynchronous, given weak assumptions on ˆx k. l contraction assumption on mapping T is quite strong. Liu et al. (2015) assume no component of ˆx k older than nonnegative integer τ (maximum delay) at any k. τ related to number of processors τ (indicator of potential parallelism) If all processors complete updates at approx same rate, τ cτ for some positive integer c. Linear convergence if essential strong convexity holds. Sublinear convergence for general convex functions. Near-linear speedup if number of processors is: O(n 1/2 ) in unconstrained optimization. O(n 1/4 ) in the separable-constrained case.

18 12 / 38 Question Under what structural assumptions does parallelization lead to acceleration?

19 13 / 38 Convergence of Randomized Coordinate Descent In IR n, randomized coordinate descent with uniform selection requires: O(n ξ(ɛ)) Strong convex F : ξ(ɛ) = log ( ) 1 ɛ iterations Smooth, or simple nonsmooth F : ξ(ɛ) = 1 ɛ Difficult nonsmooth F : ξ(ɛ) = 1 ɛ 2 When dealing with big data, we only care about n.

20 14 / 38 The Parallelization Dream Serial Parallel (1 coordinate per iteration) (τ coordinates per iteration) O(n ξ(ɛ)) iterations O ( n τ ξ(ɛ)) iterations What do we actually get? Want β = O(1). ( ) nβ O τ ξ(ɛ) Depends on extent to which we can add up individual updates. Properties of F, select of coordinates at each iteration.

21 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. f(1, 1) = 1 f(x 1, x 2 ) = 0 f(0, 0) = 1 15 / 38

22 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1

23 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1

24 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates.

25 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 x k+2 1

26 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 NO GOOD! x k+2 x k+2 1

27 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? x k 16 / 38

28 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1

29 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1

30 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? SOLVED! x

31 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38

32 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38

33 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 1 x k 1 17 / 38

34 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38

35 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... AND SO ON... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38

36 18 / 38 Averaging may be too conservative... Consider the function f(x) = (x 1 1) 2 + (x 2 1) (x n 1) 2. Evaluate at x 0 = 0, f(x 0 ) = n. We want ( f(x k ) = n 1 n) 1 2k ɛ. With averaging, we get k n ( n ) 2 log Factor of n is bad! ɛ ( ) We wanted O nβ τ ξ(ɛ)

37 19 / 38 What to do? We can write the coordinate descent update as follows, x + x + 1 β n h i e i, i=1 where h i is the update to coordinate i e i is the ith unit coordinate vector Averaging: β = n Summation: β = 1 When can we safely use β 1?

38 Three models for f with small β: When can we use small β? 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = f J (x), f J depends on x i for i J only J J ω := max J J J 2 Nonsmooth max-type f [Fercoq & Richtárik, 2013] f(x) = max z Q {z T Ax g(z)} ω := max 1 j m {i : A ji 0} 3 f with bounded Hessian [Bradley et al., 2011, Richtárik & Takáč, 2013a] f(x + h) f(x) + f(x) T h + 1 L = diag(a T A) 2 ht A T Ah σ := λ max (L 1/2 A T AL 1/2 ) ω is the degree of partial separability, σ is spectral radius. 20 / 38

39 21 / 38 Parallel Coordinate Descent Method At iteration k, select a random set S k. S k is a realization of a random set-valued mapping (or sampling) Ŝ. Update h i depends on F, x and on law describing Ŝ. Continuously interpolates between serial coordinate descent and gradient. Manipulates n and E[ Ŝ ].

40 22 / 38 ESO: Expected Separable Overapproximation We say that f admits a (β, ω)-eso with respect to (uniform) sampling Ŝ if for all x, h IR n (f, Ŝ) ESO(β, w) E [ f(x + h [ Ŝ] ) ] f(x)+ E[ Ŝ ] n where h [ Ŝ] = i Ŝ h ie i, and h 2 w := n i=1 w i(h i ) 2 We note that f(x) T h + β 2 h 2 w is separable in h. Minimize with respect to h in parallel yields update x + x i f(x)e i β w i Compute updates for i Ŝ only. Separable quadratic overapproximation of E[f] evaluated at update. i Ŝ ( f(x) T h + β ) 2 h 2 w

41 23 / 38 Convergence Rate for Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) (2R βn 2 k w (x 0, x ) ( ) F (x 0 ) F ) E[ Ŝ ] log, ɛ ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. ɛ: error tolerance n: # coordinates E[ Ŝ ]: average # updated coordinates per iteration β: step size parameter

42 24 / 38 Convergence Rate for Strongly Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) ( ) ( n β + µg (w) F (x 0 ) F ) k E[ Ŝ ] log, µ f (w) + µ g (w) ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. µ f (w): strong convexity constant of loss f µ g (w): strong convexity constant of regularizer g If µ g (w) is large, then the slowdown effect of β is eliminated.

43 25 / 38 What if problem is only partially separable? Uniform sampling: P (Ŝ = {i}) = 1 n 1 τ-nice sampling: P (Ŝ = S) = ( n τ), S = τ for shared memory systems 0, otherwise At each iteration: Choose set of i, each subset of τ coordinates chosen with the same probability. Assign each i to a dedicated processor. Compute and apply the update. All blocks are the same size τ (otherwise, probability is 0).

44 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

45 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

46 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

47 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

48 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

49 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

50 27 / 38 What if problem is only partially separable? Binomial sampling: consider independent equally unreliable processors. Each of τ processors available with probability p b, busy with probability 1 p b. # available processors (number of blocks that can be updated in parallel) at each iteration is a binomial random variable with parameters τ and p b. Use explicit or implicit selection.

51 ESO Theory 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = J J f J (x), f J depends on x i for i J only ω := max J J J Theorem: If Ŝ is doubly uniform, then ] E [f(x + h [ Ŝ] ) f(x) + E[ Ŝ ] n where β = 1 + ( f(x) T h + β ) 2 h 2 w, ( ) (ω 1) E[ Ŝ 2 ] 1 E[ Ŝ ], w i = L i. i = 1, 2,..., n n 1 β is small if ω is small (i.e., more separable) 28 / 38

52 29 / 38 ESO Theory (Richtárik & Takáč, 2013) Parallel Coordinate Descent Methods for Big Data Optimization.

53 30 / 38 Theoretical Speedup s(r) = τ 1 + r(τ 1), r = (ω 1)/(n 1) ω often a constant that depends on n. r is a measure of density. MUCH OF BIG DATA IS HERE!

54 Theory vs Practice τ = # processors vs. theoretical (left) and experimental (right) speed-up for n = 1000 coordinates 31 / 38

55 32 / 38 Experiment 1 billion-by-2 billion LASSO problem [Richtárik & Takáč, 2012] f(x) = 1 2 Ax b 2 2, g(x) = x 1 A has rows and n = 10 9 columns. x 0 = 10 5 A :,i 0 = 20 (column) max j A j,: 0 = 35 (row) ω = 35 (degree of partial separability of f). Used approximation of τ-nice sampling Ŝ (independent sampling, τ << n). Asynchronous implementation. Older information used to update coordinates, but observed slow down is limited.

56 Experiment: Coordinate Updates For each τ, serial and parallel CD need approximately same number of coordinate updates. Method identifies active set. 33 / 38

57 34 / 38 Experiment: Iterations Doubling τ roughly translates to halving the number of iterations.

58 35 / 38 Experiment: Wall Time Doubling τ roughly translates to halving the wall time.

59 36 / 38 Citation Algorithm Paper (Bradley et al, 2011) Shotgun Parallel coordinate descent for L1-regularized loss minimization. ICML, 2011 (arxiv: ) (Richtárik & Takáč, 2011) SCD Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings, 27-32, 2012 (Opt Online 08/2011) (Richtárik & Takáč, 2012) PCDM Parallel coordinate descent methods for big data optimization. Mathematical Programming, 2015 (arxiv: ) (Fercoq & Richtárik, 2013) SPCDM Smooth minimization of nonsmooth functions with parallel coordinate descent method (arxiv: ) (Richtárik & Takáč, 2013a) HYDRA Distributed coordinate descent method for learning with big data (arxiv: ) (Liu et al., 2013) AsySCD An asynchronous parallel stochastic coordinate descent algorithm. ICML 2014 (arxiv: ) (Fercoq & Richtárik, 2013) APPROX Accelerated, parallel and proximal coordinate descent (arxiv: ) (Yang, 2013) DisDCA Trading computation for communication: distributed stochastic dual coordinate ascent. NIPS 2013 (Bian et al, 2013) PCDN Parallel coordinate descent Newton method for efficient l 1 -regularized minimization (arxiv: ) (Liu & Wright, 2014) AsySPCD Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), , 2015 (arxiv: ) (Mahajan et al, 2014) DBCD A distributed block coordinate descent method for training l 1 -regularized linear classifiers. arxiv: , 2014

60 37 / 38 Citation Algorithm Paper (Fercoq et al, 2014) Hydra2 Fast distributed coordinate descent for non-strongly convex losses. MLSP 2014 (arxiv: ) (Mareček, Richtárik and Takáč, 2014) DBCD Distributed block coordinate descent for minimizing partially separable functions. Numerical Analysis and Opt., Springer Proc. in Math. and Stat. (arxiv: ) (Jaggi, Smith, Takáč et al, 2014) CoCoA Communication-efficient distributed dual coordinate ascent. NIPS 2014 (arxiv: ) (Qu, Richtárik & Zhang, 2014) QUARTZ Randomized dual coordinate ascent with arbitrary sampling. arxiv: , 2014 (Ma, Smith, Jaggi et al, 2015) CoCoA+ Adding vs. averaging in distributed primal-dual optimization. ICML 2015 (Tappenden, Takáč & Richtárik, 2015) PCDM On the complexity of parallel coordinate descent. arxiv: , 2015 (Hsieh, Yu & Dhillon, 2015) PASSCoDe PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent. ICML, 2015 (Peng et al, 2016) ARock ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput., 2016 (You et al, 2016) Asy-GCD Asynchronous parallel greedy coordinate descent. NIPS,

61 38 / 38 Conclusions Coordinate descent scales very well to big data problems of special structure. Requires (partial) separability/sparse dependency graph. Care is needed when combining updates (add them up? average?) Sampling strategies that take into account unreliable processors.

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem: