Asynchronous Parallel Computing in Signal Processing and Machine Learning

Size: px

Start display at page:

Download "Asynchronous Parallel Computing in Signal Processing and Machine Learning"

Bertha Farmer
5 years ago
Views:

1 Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA, Jan 25, / 49

2 Do we need parallel computing? 2 / 49

3 Back in / 49

4 / 49

5 / 49

6 35 Years of CPU Trend Number of CPUs Performance per core Cores per CPU D. Henty. Emerging Architectures and Programming Models for Parallel Computing, In May 2014, Intel cancelled its Tejas project (single-core) and announced a new multi-core project. 6 / 49

7 Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total) 7 / 49

8 Today: Tesla K80 GPU (2496 cores) 8 / 49

9 Today: Octa-Core Headsets 9 / 49

10 Free lunch was over before 2005: a single-threaded algorithm automatically gets faster now, new algorithms must be developed for faster speeds by exploiting problem structures taking advantages of dataset properties using all the cores available 10 / 49

11 How to use all the cores available? 11 / 49

12 Parallel computing Problem Agent Agent Agent t N t 2 t 1 12 / 49

13 definition: time is in the wall-clock sense Parallel speedup speedup = serial time parallel time Amdahl s Law: N agents, no overhead, ρ = percentage of parallel computing ideal speedup = 1 (ρ/n) + (1 ρ) % 50% 90% 95% Speedup Number of processors 13 / 49

14 Parallel speedup ε := parallel overhead (startup, synchronization, collection) in the real world actual speedup = 1 (ρ/n) + (1 ρ) + ε Speedup % 50% 90% 95% Speedup % 50% 90% 95% Number of processors when ε = N Number of processors when ε = log(n) 14 / 49

15 Sync-parallel versus async-parallel Agent 1 idle idle Agent 1 Agent 2 idle Agent 2 Agent 3 idle Agent 3 Synchronous (wait for the slowest) Asynchronous (non-stop, no wait) 15 / 49

16 Async-parallel coordinate updates 16 / 49

17 Fixed point iteration and its parallel version H = H 1 H m original iteration: x k+1 = T x k =: (I ηs)x k all agents do in parallel: agent 1: x k+1 1 T 1(x k ) = x k 1 ηs 1(x k ) agent 2: x k+1 2 T 2(x k ) = x k 2 ηs 2(x k ). agent m: x k+1 m T m(x k ) = x k m ηs m(x k ) assumption: 1. coordinate friendliness: cost of S ix 1 m 2. synchronization after each iteration cost of Sx 17 / 49

18 Comparison Agent 1 Agent 2 Agent 3 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 Synchronous new iteration = all agents finish Asynchronous new iteration = any agent finishes 18 / 49

19 ARock 1 : Async-parallel coordinate update H = H 1 H m p agents, possibly p m each agent randomly picks i {1,..., m} and updates just x i: x k+1 1 x k 1 x k+1 i x k+1 m 0 d k τ, maximum delay.. x k i η k S ix k d k x k m 1 Peng-Xu-Yan-Y / 49

20 Two ways to model x k d k definitions: let x 0,..., x k,... be the states of x in the memory 1. x k d k is consistent if d k is a scalar 2. x k d k is possibly inconsistent if d k is a vector, different components are delayed by different amounts ARock allows both consistent and inconsistent read. 20 / 49

21 Memory lock illustration Agent 1 read [0, 0, 0, 0] T = x 0 consistent read Agent 1 read [0, 0, 0, 2] T {x 0, x 1, x 2 } inconsistent read 21 / 49

22 History and recent literature 22 / 49

23 Brief history of async-parallel algorithms (mostly worst case analysis) 1969 a linear equation solver by Chazan and Miranker; 1978 extended to the fixed-point problem by Baudet under the absolute-contraction 2 type of assumption. For years, mainly solve linear, nonlinear and differential equations by many people 1989 Parallel and Distributed Computation: Numerical Methods by Bertsekas and Tsitsiklis Review by Frommer and Szyld gradient-projection itr assuming a local linear-error bound by Tseng 2001 domain decomposition assuming strong convexity by Tai & Tseng 2 An operator T : R n R n is absolute-contractive if T (x) T (y) P x y, component-wise, where x denotes the vector with components x i, i = 1,..., n, and P R n n and ρ(p ) < / 49

24 Absolute-contraction Absolute-contractive operator T : R n R n : if T (x) T (y) P x y, component-wise, where x denotes the vector with components x i, i = 1,..., n, and P R n n + and ρ(p ) < 1. Interpretation: a series of nested rectangular boxes for x k+1 = T x k Applications: diagonally dominated A for Ax = b diagonally dominated 2 f for min x f(x) (just strong convexity is not enough) some network flow problems 24 / 49

25 Recent work (stochastic analysis) AsySCD for convex smooth and composite minimization by Liu et al 14 and Liu-Wright 14. Async dual CD (regression problems) by Hsieh et al. 15 Async randomized (splitting/distributed/incremental) methods: Wei-Ozdaglar 13, Iutzeler et al 13, Zhang-Kwok 14, Hong 14, Chang et al 15 Async SGD: Hogwild!, Lian 15, etc. Async operator sample and CD: SMART Davis / 49

26 Random coordinate selection select x i to update with probability p i, where min i p i > 0 drawback: agents cannot cache data either global memory or communication is required pseudo-random number generation takes time benefits: often faster than the fixed cyclic order automatic load balance simplifies certain analysis 26 / 49

27 Convergence summary 27 / 49

28 Convergence guarantees m is # coordinates, τ is the maximum delay, uniform selection p i 1 m Theorem (almost sure convergence) Assume that T is nonexpansive and has a fixed point. Use step sizes 1 η k [ɛ, ), k. Then, with probability one, 2m 1/2 τ+1 xk x FixT. In addition, rates can be derived. Consequence: step size is O(1) if τ m. Under equal agents and updates, attaining linear speedup if using p = O( m) agents. p can be bigger if T is sparse. 28 / 49

29 Sketch of proof typical inequality: x k+1 x 2 x k x 2 c T x k x k 2 + harmful terms(x k 1,..., x k τ ) Descent inequality under a new metric: E ( x k+1 x 2 ) M X k x k x 2 M c T x k x k 2 where the history up to iteration k x k = (x k, x k 1,..., x k τ ) H τ+1, k 0 any x = (x, x,..., x ) X H τ+1 M is a positive definite matrix. c = c(η k, m, τ) 29 / 49

30 apply the Robbins-Siegmund theorem: E(α k+1 F k ) + v k (1 + ξ k )α k + η k where all are nonnegative, α k is random, and ξ k, η k are summable. Then α k converges a.s. prove weakly convergent clustering points are fixed-points; assume H is separable and apply results [Combettes, Pesquet 2014]. 30 / 49

31 Applications and numerical results 31 / 49

32 Linear equations (asynchronous Jacobi) require: invertible square matrix A with nonzero diagonal entries let D be the diagonal part of A; then Ax = b (I D 1 A)x + D 1 b = x. }{{} T x T is nonexpansive if I D 1 A 2 1, i.e., A is diagonal dominating x k+1 = T x k recovers the Jacobi algorithm 32 / 49

33 Algorithm 1: ARock for linear equations Input : shared variables x R n, K > 0; set global iteration counter k = 0; while k < K, every agent asynchronously and continuously do sample i {1,..., m} uniformly at random; add η k a ii ( j aij ˆxk j b i) to shared variable x i; update the global counter k k + 1; 33 / 49

34 Sample code loaddata (A, data_file_name ); loaddata (b, label_file_name ); # pragma omp parallel num_threads (p) shared (A,b,x, para ) { // A, b, x, and para are passed by reference call Jacobi (A,b,x, para ) or ARock (A,b,x, para ); } p: the number of threads A,b,x: shared variable para: other parameters 34 / 49

35 Jacobi worker function for ( int itr =0; itr < max_itr ; itr ++) { // compute the update for the assigned x[ i] //... # pragma omp barrier { // write x[ i] in global memory } # pragma omp barrier } Jacobi needs the barrier directive for synchronization 35 / 49

36 ARock worker function for ( int itr =0; itr < max_itr ; itr ++) { // pick i at random // compute the update for x[ i] //... // write x( i) in global memory } ARock has no synchronization barrier directive 36 / 49

37 Minimizing smooth functions require: convex and Lipschitz differentiable function f if f is L-Lipschitz, then minimize x where T is nonexpansive f(x) x = ( I 2 L f) x. }{{} T ARock will be efficient when xi f(x) is easy to compute 37 / 49

38 Minimizing composite functions require: convex smooth g( ) and convex (possibly nonsmooth) f( ) proximal map: prox γf (y) = arg min f(x) + 1 2γ x y 2. minimize x ARock will be fast if xi g(x) is easy to compute f(x) + g(x) x = prox γf (I γ g) x. }{{} T f( ) is separable (e.g., l 1 and l 1,2) 38 / 49

39 Example: sparse logistic regression n features, N labeled samples each sample a i R n has its label b i {1, 1} l 1 regularized logistic regression: minimize x R n λ x N N log ( 1 + exp( b i a T i x) ), (1) i=1 compare sync-parallel and ARock (async-parallel) on two datasets: Name N (#samples) n (# features) # nonzeros in {a 1,..., a N } rcv1 20,242 47,236 1,498,952 news20 19,996 1,355,191 9,097, / 49

40 Speedup tests implemented in C++ and OpenMP. 32 cores shared memory machine. rcv1 news20 #cores Time (s) Speedup Time (s) Speedup async sync async sync async sync async sync / 49

41 More applications 41 / 49

42 Minimizing composite functions require: both f and g are convex (possibly nonsmooth) functions minimize f(x) + g(x) z = refl γf refl γg (z). }{{} T PRS recover x = prox γg (z) T PRS is known as the Peaceman-Rachford splitting operator 3 also, the Douglas-Rachford splitting operator: ARock runs fast when refl γf is separable (refl γg) i is easy-to-compute/maintain 1 2 I T PRS 3 reflective proximal map: reflγf := 2prox γf I. The maps refl γf, refl γg and thus refl γf refl γg are nonexpansive 42 / 49

43 Parallel/distributed ADMM require: m convex functions f i consensus problem: minimize x m fi(x) + g(x) i=1 minimize x i,y subject to m fi(xi) + g(y) i=1 I I... I x 1 x 2. x m I I. y = 0 I Douglas-Rachford-ARock to the dual problem async-parallel ADMM: m subproblems are solved in the async-parallel fashion y and z i (dual variables) are updated in global memory (no lock) 43 / 49

44 Decentralized computing n agents in a connected network G = (V, E) with bi-directional links E each agent i has a private function f i problem: find a consensus solution x to minimize x R p f(x) := n f i(a ix i) subject to x i = x j, i, j. i=1 challenges: no center, only between-neighbor communication benefits: fault tolerance, no long-dist communication, privacy 44 / 49

45 Async-parallel decentralized ADMM a graph of connected agents: G = (V, E). decentralized consensus optimization problem: minimize x i R d,i V f(x) := i V fi(xi) subject to x i = x j, (i, j) E ADMM reformulation: constraints x i = y ij, x j = y ij, (i, j) E apply ARock version 1: nodes asynchronously activate ARock version 2: edges asynchronously activate no global clock, no central controller, each agent keeps f i private and talks to its neighbors 45 / 49

46 notation: N(i) all edges of agent i, N(i) = L(i) R(i) L(i) neighbors j of agent i, j < i R(i) neighbors j of agent i, j > i Algorithm 2: ARock for the decentralized consensus problem Input : each agent i sets x 0 i R d, dual variables z 0 e,i for e E(i), K > 0. while k < K, any activated agent i do receive ẑli,l k from neighbors l L(i) and ẑir,r k from neighbors r R(i); update local ˆx k i, z k+1 li,i and z k+1 ir,i according to (2a) (2c), respectively; send z k+1 li,i to neighbors l L(i) and z k+1 ir,i to neighbors r R(i). ˆx k i arg min f i(x i) + ( x l L(i) ẑk li,l + ) r R(i) ẑk ir,r xi + γ 2 E(i) xi 2, i (2a) z k+1 ir,i = z k ir,i η k ((ẑ k ir,i + ẑ ir,r)/2 + γˆx k i ), r R(i), (2b) z k+1 li,i = z k li,i η k ((ẑ k li,i + ẑ li,l )/2 + γˆx k i ), l L(i). (2c) 46 / 49

47 Summary 47 / 49

48 Summary of async-parallel coordinate descent benefits: eliminate idle time reduce communication / memory-access congestion random job selection: load balance mathematics: analysis of disorderly (partial) updates asynchronous delay inconsistent read and write 48 / 49

49 Thank you! Acknowledgements: NSF DMS Reference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM Website: wotaoyin/arock 49 / 49

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,