Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Size: px

Start display at page:

Download "Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence"

Joella Henry
5 years ago
Views:

1 Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li ( PDSL CS, NJU 1 / 36

2 Outline Introduction 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 2 / 36

3 Machine Learning Introduction Supervised Learning: Given a set of training examples {(x i, y i )} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min f(w) = 1 n f i (w), w n where f i (w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n n i=1 [log(1 + e yixt i w ) + λ 2 w 2 ] SVM: f(w) = 1 n n i=1 [max{0, 1 y ix T i w} + λ 2 w 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li ( PDSL CS, NJU 3 / 36 i=1

4 Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): w t+1 w t η t [ 1 n f i (w t )], n where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example i t {1, 2,..., n}, then update i=1 w t+1 w t η t f it (w t ) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li ( PDSL CS, NJU 4 / 36

5 Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient ( f i ) evaluation to reach ɛ for smooth and strongly convex problems: GD: O(nκ log( 1 ɛ )) SGD: O(κ( 1 ɛ )) SAG: O(n log( 1 ɛ )) when n 8κ SDCA: O((n + κ) log( 1 ɛ )) SVRG: O((n + κ) log( 1 ɛ )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li ( PDSL CS, NJU 5 / 36

6 Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li ( PDSL CS, NJU 6 / 36

7 Our Contributions Introduction Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li ( PDSL CS, NJU 7 / 36

8 Outline AsySVRG 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 8 / 36

9 AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li ( PDSL CS, NJU 9 / 36

10 AsySVRG AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w 0, η; for t = 0, 1, 2,... do u 0 = w t ; All threads parallelly compute the full gradient f(u 0 ) = 1 n n i=1 f i(u 0 ); u = w t ; For each thread, do: for m = 1 to M do Read current value of u, denoted as û, from the shared memory. And randomly pick up an i from {1,..., n}; Compute the update vector: ˆv = f i (û) f i (u 0 ) + f(u 0 ); u u ηˆv; end for Take w t+1 to be the current value of u in the shared memory; end for Wu-Jun Li ( PDSL CS, NJU 10 / 36

11 Lock-free Analysis AsySVRG In all the GD or SGD methods to solve the objective function, the key step can be written as Notation u u + i,j : the j th update vector computed by the i th thread; u R d and u = (u (1), u (2),..., u (d) ); t (k) i,j : the time when the operation u(k) u (k) + (k) i,j has been completed (Not the time when the operation begins); Assuming i, j, t (1) i,j < t(2) i,j <... < t(d) i,j, which can be easily guaranteed by programming Wu-Jun Li ( PDSL CS, NJU 11 / 36

12 AsySVRG Lock-free Analysis: update sequence Since u (1) can only be changed by at most one thread at any absolute time, these t (1) i,j are different from each other. So we can: Sort these t (1) i,j as t(1) 0 < t (1) 1 <... < t (1) M 1 ( M = p M); 0, 1,..., M 1 are the corresponding update vectors. Since it is lock-free, for each update vector m, the real update vector is B m m because of over-written. The B m is a diagonal matrix whose diagonal elements are 0 or 1. After all the inner-loop stop, we can get: w t+1 = u 0 + M 1 m=0 B m m (1) Wu-Jun Li ( PDSL CS, NJU 12 / 36

13 AsySVRG Lock-free Analysis: update sequence According to (1), we define a sequence {u m } as follows: which means u m+1 = u m + B m m. m 1 u m = u 0 + B i i (2) Note The sequence {u m } (m = 1, 2,..., M 1) is synthetic, and the whole u m may never occur in the shared memory. What we can get is only the final value of u M. i=0 Wu-Jun Li ( PDSL CS, NJU 13 / 36

14 AsySVRG Lock-free Analysis: read sequence Assume the old update vectors 0, 1,..., a(m) 1 have been completely applied to u when one thread is reading the shared variable. At the same time, some new update vectors might be updating u. So we can write û m read by the thread to compute m as follows: û m = u a(m) + b(m) i=a(m) P m,i a(m) i where P m,i a(m) is a diagonal matrix whose diagonal elements are 0 or 1. According to the principle of the order, i (i m) should not be read by û m. So b(m) < m, which means: û m = u a(m) + m 1 i=a(m) P m,i a(m) i Wu-Jun Li ( PDSL CS, NJU 14 / 36

15 AsySVRG Convergence Result for Strongly Convex Problems With some assumptions, our algorithm gets a linear convergence rate for strongly convex problems: Ef(w t+1 ) f(w ) (c M 1 + c 2 1 c 1 )(Ef(w t ) f(w )), where c 1 = 1 αηµ + c 2 and c 2 = η 2 ( 8τL3 ηρ 2 (ρ τ 1) ρ 1 + 2L 2 ρ), M = p M is the total number of iterations of the inner-loop. Note Since it is lock-free, we do not know the exact B m and we can not take the average sum of B m u m to be w t+1. Wu-Jun Li ( PDSL CS, NJU 15 / 36

16 AsySVRG Convergence Result for Non-Convex Problems With some assumptions, our algorithm gets a sub-linear convergence rate for non-convex problems: 1 T M T 1 M 1 t=0 m=0 E f(u t,m ) 2 Ef(w 0) Ef(w T ) T Mγ. Similar to the analysis for strongly convex problems, we construct an equivalent write sequence {u t,m } for the t th outer-loop: u t,0 = w t, u t,m+1 = u t,m ηb t,mˆv t,m, where ˆv t,m = f it,m (û t,m ) f it,m (u t,0 ) + f(u t,0 ). B t,m is a diagonal matrix whose diagonal entries are 0 or 1. And û t,m is read by the thread who computes ˆv t,m. Wu-Jun Li ( PDSL CS, NJU 16 / 36

17 Experiments AsySVRG Experimental platform: A server with 12 Intel cores and 64G memory. Model: Logistic regression with L2-norm f(w) = 1 n n log(1 + e y ix T i w ) + λ 2 w 2 i=1 Data set We set λ = dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355, M sparse epsilon 400,000 2,000 11G dense Wu-Jun Li ( PDSL CS, NJU 17 / 36

18 AsySVRG Experiments: Computation Cost log(objective value min) log(objective value min) 10 0 rcv AsySVRG 1 AsySVRG lock 10 AsySVRG Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 real sim AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (a) rcv1 (b) realsim 10 0 news AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes log(objective value min) 10 0 epsilon AsySVRG 1 AsySVRG lock AsySVRG 10 Hogwild lock 10 Hogwild number of effective passes (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 18 / 36

19 AsySVRG Experiments: Total Time Cost log(objective value min) log(objective value min) rcv1 real sim 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (a) rcv1 (b) realsim news20 epsilon 10 0 AsySVRG 10 Hogwild time log(objective value min) 10 0 AsySVRG 10 Hogwild time (c) news20 (d) epsilon Wu-Jun Li ( PDSL CS, NJU 19 / 36

20 Experiments: Speed up 6 rcv1 5 AsySVRG 6 real sim AsySVRG lock AsySVRG AsySVRG lock AsySVRG 5 speedup 4 3 speedup number of threads number of threads (a) rcv1 (b) realsim 8 news20 6 epsilon 7 AsySVRG lock AsySVRG 5 AsySVRG lock AsySVRG unlock 6 speedup 5 4 speedup number of threads (c) news number of threads (d) epsilon Wu-Jun Li ( PDSL CS, NJU 20 / 36

21 Outline SCOPE 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 21 / 36

22 SCOPE Motivation and Contribution Motivation: Bulk synchronous parallel (BSP) models, such as MapReduce, are commonly considered to be inefficient for distributed stochastic learning. Is there any technique to solve the issues of BSP models? Contribution: A novel distributed stochastic learning method, called scalable composite optimization for learning (SCOPE), on BSP models Both computation-efficient and communication-efficient Linear convergence rate with theoretical proof Can be easily integrated into the data processing pipeline of Spark Outperform other state-of-the-art distributed learning methods on Spark Wu-Jun Li ( PDSL CS, NJU 22 / 36

23 Framework of SCOPE SCOPE Master w Worker_1 Worker_ Worker_p D! D! D! Figure: Distributed framework of SCOPE. Wu-Jun Li ( PDSL CS, NJU 23 / 36

24 SCOPE Optimization Algorithm: Master Task of Master in SCOPE: Initialization: p Workers, w 0 ; for t = 0, 1, 2,..., T do Send w t to the Workers; Wait until it receives z 1, z 2,..., z p from the p Workers; Compute the full gradient z = 1 n p k=1 z k, and then send z to each Worker; Wait until it receives ũ 1, ũ 2,..., ũ p from the p Workers; Compute w t+1 = 1 p p k=1 ũk; end for Wu-Jun Li ( PDSL CS, NJU 24 / 36

25 SCOPE Optimization Algorithm: Workers Task of Workers in SCOPE: Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2,..., T do Wait until it gets the newest parameter w t from the Master; Let u k,0 = w t, compute the local gradient sum z k = i D k f i (w t ), and then send z k to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M 1 do Randomly pick up an instance with index i k,m from D k ; u k,m+1 = u k,m η( f ik,m (u k,m ) f ik,m (w t )+z+c(u k,m w t )); end for Send u k,m or 1 M M m=1 u k,m, which is called the locally updated parameter and denoted as ũ k, to the Master; end for Wu-Jun Li ( PDSL CS, NJU 25 / 36

26 Convergence SCOPE Let α = 1 η(2µ + c) < 1, β = cη + 3L 2 η 2 and α + β < 1. We have the following theorems: Theorem If we take w t+1 = 1 p p k=1 u k,m, then we can get the following convergence result: Theorem E w t+1 w 2 (α M + β 1 α )E w t w 2. If we take w t+1 = 1 p p k=1 ũk with ũ k = 1 M M m=1 u k,m, then we can get the following convergence result: E w t+1 w 2 1 ( M(1 α) + β 1 α )E w t w 2. Wu-Jun Li ( PDSL CS, NJU 26 / 36

27 Communication Cost SCOPE Traditional mini-batch based methods: O(T n) SCOPE: O(T ) Wu-Jun Li ( PDSL CS, NJU 27 / 36

28 SCOPE Experiment Logistic regression (LR) with a L 2 -norm regularization term: f(w) = 1 ] n n i=1 [log(1 + e y ix T i w ) + λ 2 w 2. Table: Datasets for evaluation. instances features memory λ MNIST-8M 8,100, G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691, G 1e-6 Two Spark clusters with Intel CPUs: small: 1 Master and 16 Workers large: 1 Master and 128 Workers Wu-Jun Li ( PDSL CS, NJU 28 / 36

29 SCOPE Experiment Baselines: MLlib [Meng et al., 2015]: MLlib is an open source library for distributed machine learning on Spark. We compare our method with distributed lbfgs for MLlib, which is a batch learning method and faster than the SGD version of MLlib. LibLinear [Lin et al., 2014a]: LibLinear is a distributed Newton method, which is also a batch learning method. Splash [Zhang and Jordan, 2015]: Splash is a distributed SGD method by using the local learning strategy to reduce communication cost. CoCoA [Jaggi et al., 2014]: CoCoA is a distributed dual coordinate ascent method. CoCoA+ [Ma et al., 2015]: CoCoA+ is an improved version of CoCoA. CoCoA+ adopts adding rather than average to combine local updates. Wu-Jun Li ( PDSL CS, NJU 29 / 36

30 Experiment SCOPE 10 0 MNIST 8M with 16 cores 10 0 epsilon with 16 cores objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 4 (a) MNIST-8M (b) epsilon 10 0 KDD12 with 16 cores 10 0 Data A with 128 cores objective value optimal SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ objective value optimal SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA CPU Time(millisecond) x 10 5 (c) KDD CPU Time(millisecond) x 10 4 (d) Data-A Wu-Jun Li ( PDSL CS, NJU 30 / 36

31 Experiment SCOPE speedup SCOPE Ideal CPU Time(milllisecond) per iteration computation time waiting time communication time #cores #cores (e) Speedup (f) Synchronization cost Wu-Jun Li ( PDSL CS, NJU 31 / 36

32 Outline Conclusion 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li ( PDSL CS, NJU 32 / 36

33 Conclusion Conclusion Stochastic learning is becoming popular for big data machine learning. Lock-free strategy is the key to get a good speedup in parallel stochastic learning. With properly designed techniques, BSP models are also efficient for distributed stochastic learning. Wu-Jun Li ( PDSL CS, NJU 33 / 36

34 Future Work Conclusion Open source project: LIBBLE: A library for big learning LIBBLE-Spark: Classification: LR, SVM, LR with L1-norm Regularization Regression: Linear Regression, Lasso Generalized Linear Models: with L2-norm/L1-norm Regularization Dimensionality Reduction: PCA, SVD Matrix Factorization Clustering: K-Means LIBBLE-MPI LIBBLE-GPU LIBBLE-MultiThread Wu-Jun Li ( PDSL CS, NJU 34 / 36

35 References Conclusion Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, Wu-Jun Li. SCOPE: Scalable Composite Optimization for Learning on Spark. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Gong-Duo Zhang, Wu-Jun Li. Lock-Free Optimization for Non-Convex Problems. To Appear in Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI), Shen-Yi Zhao, Wu-Jun Li. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), Wu-Jun Li ( PDSL CS, NJU 35 / 36

36 Q & A Conclusion Thanks! Wu-Jun Li ( PDSL CS, NJU 36 / 36

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and