Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, ERGO Tech. report 11-011, 2011. P. Richtárik and M. Takáč. Efficient serial and parallel coordinate descent methods for huge-scale truss topology design, ERGO Tech. report 11-012 2011. October 06, 2011 1 / 43

Outline 1 Randomized Coordinate Descent Method Theory GPU implementation Numerical examples 2 Facing the Multicore-Challenge II (a few insights from a conference) 3 Discussion StarSs Intel Cilk Plus SIMD and auto-vectorization 2 / 43

Randomized Coordinate Descent Method 3 / 43

Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints General Iterative Algorithm 1 choose initial point x := x 0 R n 2 iterate choose direction d := d(x) choose step length α(d, x) set x = x + αd 4 / 43

Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T 5 / 43

Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T Methods for choosing direction cyclic (or essentially cyclic) coordinate with the maximal directional derivative (Greedy) RANDOM (WHY?) Applications Sparse regression, LASSO Truss topology design (Sparse) group LASSO Elastic Net... L 1 regularized linear support vector machines (SVM) with various (smooth) loss functions 5 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) Luckily in our case n = 2 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Block Structure of R N Any vector x R N can be written uniquely as n x = U i x (i), where x (i) = Ui T x R i R N i. i=1 8 / 43

Block Norms and Induced Norms Block norms (norms in R i ): We equip R i with a pair of conjugate Euclidean norms: t (i) = B i t, t 1/2, where B i R N i N i Induced norms (norms in R N ): t (i) = B 1 i t, t 1/2, t R i, is a positive definite matrix. For fixed positive scalars w 1,..., w n and x R N we define Conjugate norm is defined by y W = [ n 1/2 x W = w i x (i) (i)] 2. i=1 max y, x = x W 1 where W = Diag(w 1,..., w n ). [ n i=1 w 1 i ( y (i) (i) )2] 1/2, 9 / 43

Assumptions on f and Ψ The gradient of f is block coordinate-wise Lipschitz with positive constants L 1,..., L n : i f (x + U i t) i f (x) (i) L i t (i), x R N, t R i, i = 1,..., n, (1) where i f (x) def = ( f (x)) (i) = U T i f (x) R i. An important consequence of (1) is the following inequality f (x + U i t) f (x) + i f (x), t + L i 2 t 2 (i). (2) We denote L = Diag(L 1,..., L n ). Ψ is block-separable: n Ψ(x) = Ψ i (x (i) ). i=1 10 / 43

UCDC: Uniform Coordinate Descent for Composite Functions - Algorithm An upper bound on F (x + U i t) is readily available: F (x + U i t) = f (x + U i t) + Ψ(x + U i t) (2) Algorithm 1 UCDC(x 0 ) f (x) + i f (x), t + L i 2 t 2 (i) + Ψ i(x (i) + t) + Ψ j (x (j) ). }{{} j i def =V i (x,t) for k = 0, 1, 2,... Choose block coordinate i {1, 2,..., n} with probability 1 n T (i) (x k ) = arg min{v i (x k, t) : t R i R N i } x k+1 = x k + U i T (i) (x k ) end for In each step we minimize a function of N i variables We assume Ψ is simple: T (i) (x k ) can be computed in closed form If Ψ 0, then T (i) (x k ) = 1 L i B 1 i i f (x) 11 / 43

The Main Result Theorem 1 Choose initial point x 0 and confidence level ρ (0, 1). Further, let the target accuracy ɛ > 0 and iteration counter k be chosen in any of the following two ways: (i) ɛ < F (x 0 ) F and k 2n max{r2 L (x0),f (x0) F } ɛ (ii) ɛ < min{r 2 L (x 0), F (x 0 ) F } and ( ) 1 + log 1 ρ + 2 2n max{r2 L (x0),f (x0) F } F (x 0) F, k 2nR2 L (x0) ɛ log F (x0) F ɛρ. If x k is the random point generated by UCDC(x 0 ) as applied to F, then P(F (x k ) F ɛ) 1 ρ. R 2 W (x 0) = max x {max x X x x 2 W : F (x) F (x 0)} 12 / 43

Performance on Strongly Convex Functions Theorem 2 Let F be strongly convex with respect to L with convexity parameter µ > 0 and choose accuracy level ɛ > 0, confidence level 0 < ρ < 1, and ( ) k F (x0) F, n 1 γ µ log ρɛ where γ µ is given by { γ µ = 1 µ 4, if µ 2, 1 µ, otherwise. (3) If x k is the random point generated by UCDC(x 0 ), then P(F (x k ) F ɛ) 1 ρ. 13 / 43

Comparison with other coordinate descent methods for composite minimization Lip Ψ block coordinate choice E(F (x k ) < ɛ) 1 iter [YT09] L( f ) separable Yes greedy O( nl( f ) x x 0 2 2 ) ɛ O(n 2 ) 1 [SST09] L i = β 1 No O( nβ x x 0 2 2 ) n ɛ cheap [ST10] L( f ) 1 No cyclic O( nl( f ) x x 0 2 2 ) ɛ cheap 1 Alg 1 L i separable Yes O( n x x 0 2 L ) n ɛ cheap [YT09 ]S. Yun and P. Tseng. A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140:513 535, 2009. [SST09 ]S. Shalev-Shwartz and A. Tewari. Stochastic methods for l 1 regularized loss minimization. In: ICML 2009. [ST09 ]A. Saha and A. Tewari. On the finite time convergence of cyclic coordinate descent methods. arxiv:1005.2146, 2010. 14 / 43

Comparison with Nesterov s randomized coordinate descent... for achieving P(F (x k ) F ɛ) 1 ρ in the general case (F not strongly convex): Alg Ψ p i norms complexity objective [N10] 0 [N10] 0 1 n Euclid. Alg1 0 > 0 general Alg2 0 8nR 2 L (x 0) log 4(f (x 0) f ) ɛ ɛρ L i Euclid. (2n + 8SR2 I (x 0) ) log 4(f (x 0) f ) S ɛ ɛρ 1 n Euclid. 2R 2 LP 1 (x 0) f (x) + ɛ x x 0 2 L 8R 2 L (x 0) f (x) + ɛ x x 0 2 I 8R 2 I (x 0) (1 + log 1 ɛ ρ ) 2 f 2nM ɛ (1 + log 1 ρ ) F 2nR 2 L (x 0) ɛ log F (x 0) F ɛρ Symbols used in the table: S = L i, M = max{r 2 L (x 0), F (x 0 ) F }. [N10] Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper # 2010/2, 2010. 15 / 43

Application: Sparse Regression Problem formulation & Lipschitz constants 1 min x R N 2 Ax b 2 2 + γ x 1 L i = a T i a i (easy to compute) Note: Lipschitz constant of f = λ max (A T A) (hard to compute) Algorithm 1 UCDC(x 0 ) for k = 0, 1, 2,... Choose block coor. i Compute T (i) (x k ) x k+1 = x k + U i T (i) (x k ) end for UCDC for Sparse Regression Choose x 0 = 0 and set g 0 = Ax 0 b = b for k = 0, 1, 2,... do Choose i k = i {1, 2,..., N} uniformly i f (x k ) = ai T g k, i f (x k )+γ L i, if x (i) k T (i) (x k ) = i f (x k ) γ L i, if x (i) k x (i) k, otherwise, x k+1 = x k + e i T (i) (x k ), g k+1 = g k + a i T (i) (x k ). end for i f (x k )+γ L i > 0, i f (x k ) γ L i < 0, 16 / 43

Instance 1 run on CPU: A R 107 10 6, A 0 = 10 8 k/n F (x k ) F x k 0 time [s] 0.0010 < 10 16 857 0.01 12.72 < 10 11 999,246 54.48 15.23 < 10 10 997,944 65.19 18.61 < 10 9 990,923 79.69 20.61 < 10 8 978,761 88.25 23.38 < 10 7 926,167 100.08 25.91 < 10 6 763,314 110.94 28.22 < 10 5 366,835 120.84 30.66 < 10 4 57,991 131.25 32.83 < 10 3 9,701 140.55 35.05 < 10 2 2,538 150.02 37.01 < 10 1 1,722 158.39 38.26 < 10 0 1,633 163.75 40.98 < 10 1 1,604 175.38 42.71 < 10 3 1,600 182.75 42.71 < 10 4 1,600 182.77 43.28 < 10 5 1,600 185.21 44.86 < 10 6 1,600 191.94 17 / 43

Instance 2 run on CPU: A R 109 10 8, A 0 = 2 10 9 k/n F (x k ) F x k 0 time [s] 0.01 < 10 18 18,486 1.32 9.35 < 10 14 99,837,255 1294.72 11.97 < 10 13 99,567,891 1657.32 14.78 < 10 12 98,630,735 2045.53 17.12 < 10 11 96,305,090 2370.07 20.09 < 10 10 86,242,708 2781.11 22.60 < 10 9 58,157,883 3128.49 24.97 < 10 8 19,926,459 3455.80 28.62 < 10 7 747,104 3960.96 31.47 < 10 6 266,180 4325.60 34.47 < 10 5 175,981 4693.44 36.84 < 10 4 163,297 5004.24 39.39 < 10 3 160,516 5347.71 41.08 < 10 2 160,138 5577.22 43.88 < 10 1 160,011 5941.72 45.94 < 10 0 160,002 6218.82 46.19 < 10 1 160,001 6252.20 46.25 < 10 2 160,000 6260.20 46.89 < 10 3 160,000 6344.31 46.91 < 10 4 160,000 6346.99 46.93 < 10 5 160,000 6349.69 18 / 43

Acceleration on a Graphical Processing Unit (GPU) CPU GPU 1 core 448 cores 3.00 GHz 1.15 GHz 3 GFlops 1 TFlop GPU (Graphical Processing Unit): performs a single operation in parallel on multiple data (vector addition). 19 / 43

GPU Architecture Tesla C2050 14 Multiprocessors Each Multiprocessor has 32 cores (each 1.15GHz) Each core in one multiprocessor has to do the same operation on different data = SIMD (Single Instruction Multiple Data) 20 / 43

Huge and Sparse Data = Small Chance of Collision 21 / 43

Race Conditions and Atomic Operation 22 / 43

23 / 43

CPU (C) vs GPU (CUDA): step-by-step comparison n = 10 8 24 / 43

Acceleration by GPU 25 / 43

Naive Implementation Implementation each thread choose its coordinate independently, matrix A is naturally stored in CSC sparse format. 26 / 43

Naive Implementation Data access is NOT coalesced (is NOT sequential at all) is misaligned (decrease of performance ) 27 / 43

Misaligned copy 28 / 43

Better implementation Memory access improvement all threads in ONE WRAP generate the same random number which points to begin of memory bank (from where data will be read) each thread then compute its coordinate i (one wrap is working on ALIGNED and SEQUENTIAL coordinates) 29 / 43

Pipelining C++ code i n t o f f s e t =.... ; f o r ( i n t i = o f f s e t ; i < N; i ++) { A[ i ] = A[ i + o f f s e t ] s ; } Left: Offset is a constant. Right: Offset is a variable, therefore compiler cannot optimize as well as previously. 30 / 43

Facing the Multicore-Challenge II (a few insights from a conference) 31 / 43

SIMD and Auto-Vectorization CPU 2.2 GHz, 4 flops/cycle, 12 cores = 105.6 GFlops (AMD Magny-Cours) GPU 1030.4 GFlops (Tesla C2050/C2070 GPU Computing Processor)...the memory subsystem (or at least the cache) must be able to sustain sufficient bandwidth to keep all units busy... 32 / 43

Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 33 / 43

Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... #pragma i v d e p f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 34 / 43

StarSs - Introduction 35 / 43

StarSs - Renaming 36 / 43

StarSs - GPU 37 / 43

StarSs - GPU 38 / 43

Intel Cilk Plus Write parallel programs using a simple model: With only three keywords to learn, C and C++ developers move quickly into the parallel programming domain. Utilize data parallelism by simple array notations that include elemental function capabilities. Leverage existing serial tools: The serial semantics of Intel Cilk Plus allows you to debug in a familiar serial debugger. Scale for the future: The runtime system operates smoothly on systems with hundreds of cores. 39 / 43

Cilk - Arrays { } d e c l s p e c ( v e c t o r ) f l o a t s a x p y e l e m e n t a l ( f l o a t a, f l o a t x, f l o a t y ) return ( a x + y ) ; // Here i s how to i n v o k e such v e c t o r i z e d e l e m e n t a l // f u n c t i o n s u s i n g a r r a y n o t a t i o n : z [ : ] = s a x p y e l e m e n t a l ( a, b [ : ], c [ : ] ) ; a [ : ] = s i n ( b [ : ] ) ; // v e c t o r math l i b r a r y f u n c t i o n a [ : ] = pow ( b [ : ], c ) ; // b [ : ] c v i a a v e c t o r // math l i b r a r y f u n c t i o n 40 / 43

Cilk - Fibonacci - Create new thread 01 c i l k i n t f i b ( i n t n ) 02 { 03 i f ( n < 2) return n ; 04 e l s e 05 { 06 i n t x, y ; 07 08 x = spawn f i b ( n 1); 09 y = spawn f i b ( n 2); 10 11 sync ; 12 13 return ( x+y ) ; 14 } 15 } 41 / 43

Any Questions? Thank you for your attention 42 / 43

References Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers Jesus Labarta, Rosa Badia, Eduard Ayguade, Marta Garcia, Vladimir Marjanovic: Hybrid Parallel Programming with MPI+SMPSs 43 / 43