Learning in a Distributed and Heterogeneous Environment

Size: px

Start display at page:

Download "Learning in a Distributed and Heterogeneous Environment"

May Leonard
5 years ago
Views:

1 Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th

2 Machine Learning Methods to Analyze Large-Scale Data Machine Systems Learning Optimization Applications

3 Machine Learning Systems machine

4 Foto: Florian Hirzinger

5 Machine Learning Systems machine 1 GPU 1a machine 2 GPU 2a machine 3 FPGA 1b FPGA 2b

6 Challenge The Cost of Communication v 2 R 100 Reading v from memory (RAM) 100 ns Sending v to another machine ns Typical Map-Reduce iteration ns

7 Challenge The Cost of Communication (A) Spark (B) Spark+C (C) pyspark (D) pyspark+c (E) MPI suboptimality time [s] High-Performance Distributed Machine Learning using Apache Spark Dünner et al. 2016, arxiv.org/abs/

8 Challenge Usability Good distributed and parallel code is hard no reusability of good single machine algorithms & code no portability: model-specific and system-specific code

9 1 Distributed What if the data does not fit onto one device anymore? machine 1 machine 2 machine 3 machine 4 machine 5

10 Problem class min f(a ) + g( ) 2R n

11 One-Shot Averaging Does Not Work machine 1 w (1) := w machine 2 machine 3 machine 4 (1) Reduce machine 5 w w := 1 K P w k (k) (5) := w (5)

In particular, to solve the input problem (I) through our framew er mapping it to one of the following two general problems: Aloc [k] +w h i primal

12 tead run on the dual formulation of the primal objective. In developing e present an abstraction that allows for either a primal or dual variant of Optimization: Primal-Dual Context be run. In particular, to solve the input problem (I) through our framew er mapping it to one of the following two general problems: Aloc [k] +w h i primal Lasso dual L2-reg SVM/Log-Regr minn OA ( ) := f (A ) + g( ) 2 2R primal L1-reg SVM/Log-Reg Authors correspondence 3 w := rf (A ) min w2rd n R and w 2 d R h > OB (w) := g ( A w) + f (w) i are parameter vectors, and A := [x1 ;... ; xn ] 2 d (B) d n R is a data

machine 4 machine 5 1... 1M repeat w 1M.

13 CoCoA - Communication Efficient Distributed Optimization machine 1 machine 2 machine 3 machine 4 machine M repeat w 1M... 2M 4M... 5M (1) w w := w + 1 K P k w (k) (5)

14 Primal Subopti Seconds Primal Suboptimality: OA( )-OA( *) 2000 Webspam - Lasso: Suboptimality vs. Time 0 Table 1. Datasets for Empirical Study Dataset Training Features Sparsity url epsilon kddb webspam 2,396, ,000 19,264, ,000 3,231,961 2,000 29,890,095 16,609, e-3% 100% 9.8e-5% 0.02% was prohibitively slow, and we thus use iterations of conju gate gradient and improve performance by allowing early stopping, as well as using a varying penalty parameter practices described in (Boyd et al., 2010, 4.3, 3.4.1) For mini-batch SGD (Mb-SGD), we tune the step size and mini-batch size parameters. For mini-batch CD (Mb-CD) we scale the updates at each round by b for mini-batch siz b and 2 [1, b], and tune both parameters b and. Furthe implementation details are given in the Appendix (Sec C) NIPS 2014, ICML 2015, arxiv.org/abs/ Figure 1. Suboptimality in terms of D( ) for solving Lasso reg =1E -5), and webspam (K=16, =1E -5) datasets. PROX C O C mini-batch SGD, Shotgun, and OWL-QN in terms of the time in Distributed Experiments L1-Regularized Linear Regression CoCoA-Primal Shotgun Mb-CD Mb-SGD Prox-GD OWL-QN ADMM In contrast to these described methods, we note tha PROX C O C OA+ comes with the benefit of having only a sin gle parameter to tune: the number of local subproblem it erations, H. We further explore the effect of this paramete in Figure 3, and provide a general guideline for choosing i in practice (see Remark 1). part of TensorFlow core (L2) CoCoA-Primal Shotgun Mb-CD Mb-SGD Prox-GD OWL-QN ADMM Experiments are run on Amazon EC2 m3.xlarge machine with one core per machine for the datasets in Table 1 For Shotgun, Mb-CD, and PROX C O C OA+ in the primal datasets are distributed by feature, whereas for Mb-SGD OWL-QN, and ADMM they are distributed by datapoint. custom code (L1), TF,spark,C Seconds

15 Summary Next Steps second-order and trust- adaptivity to the region version (local Hessian) communication cost adaptivity to the degree of re-usability of good separability (coming soon) existing solvers generalization to deep accuracy certificates learning, SGD benchmarking & code

16 2 Leveraging Heterogenous Systems Compute & Memory Hierarchy: Which data to put in which device? machine 1 GPU 1a machine 2 FPGA 1b

17 Leveraging Heterogenous Systems duality gap as selection criterion Unit A Unit B 30GB 8GB adaptive importance sampling AISTATS 2017, 2018 NIPS 2017a,b

18 Experiments RAM GPU, 30GB dataset Lasso SVM

19 Open Research limited precision operations for efficiency of communication and computation asynchronous and fault tolerant algorithms machine 1 GPU 1a multi-level approach on heterogenous systems more re-usable algorithmic building blocks - for more systems and problems FPGA 1b

20 Thanks! mlo.epfl.ch Celestine Dünner, Virginia Smith, Simone Forte, Thomas Parnell, Chenxin Ma, Martin Takac, Dmytro Perekrestenko, Volkan Cevher, Michael I. Jordan, Thomas Hofmann, Sebastian Stich, Anant Raj

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith