Fast, Cheap and Deep Scaling machine learning

Size: px

Start display at page:

Download "Fast, Cheap and Deep Scaling machine learning"

Barbara Franklin
6 years ago
Views:

1 Fast, Cheap and Deep Scaling machine learning SFW Alexander Smola CMU Machine Learning and github.com/dmlc

2 Many thanks to Mu Li Dave Andersen Chris Dyer Li Zhou Ziqi Liu Manzil Zaheer Qicong Chen Amr Ahmed (Google) Yu-Xiang Wang Jay Yoon Lee Ha Loc Do (SMU) CXXNET Team Tianqi Chen (UW) Bing Xu Naiyang Wang Minerva Team Minjie Wang Tianjun Xiao Jianpeng Li Jiaxing Zhang

3 This talk in 3 slides

4 Parameter Server Server Server Data (local or cloud) read Data (local or cloud) read Data (local or cloud) read local state (or copy) local state (or copy) local state (or copy) write update write update write update Data (local or cloud) Data (local or cloud) Data (local or cloud)

5 Multicore Data (local or cloud) read Parameter Server local state (or copy) write update Data (local or cloud)

6 GPUs (for Deep Learning) Data (local or cloud) read Parameter Server local state (or copy) write update Data (local or cloud)

7 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)

8 p(click {z} =:y ad, query,w) {z } =:x Estimate Click Through Rate

9 Click Through Rate (CTR) Linear function class Logistic regression Optimization Problem minimize w f(x) =hw, xi p(y x, w) = mx i=1 Solve distributed over many machines (typically 1TB to 1PB of data) exp ( y hw, xi) sparse models for advertising log(1 + exp( y i hw, x i i)) + kwk 1

10 Optimization Algorithm Compute gradient on data l1 norm is nonsmooth, hence proximal operator argmin w kwk kw (w t g t )k 2 Updates for l1 are very simple w i sgn(w i ) max(0, w i ) All steps decompose by coordinates Solve in parallel (and asynchronously)

11 Parameter Server Template Smola & Narayanamurthy, 2010, VLDB Gonzalez et al., 2012, WSDM Dean et al, 2012, NIPS Shervashidze et al., 2013, WWW Google, Baidu, Facebook, Amazon, Yahoo, Microsoft Compute gradient on (subset of data) on each client Send gradient from client to server asynchronously push(key_list,value_list,timestamp) Proximal gradient update on server per coordinate Server returns parameters pull(key_list,value_list,timestamp)

12 Solving it at scale Li et al., OSDI TB data, variables Local file system stores files 1000 servers (corp cloud), 1h time, 140 MB/s learning Online solver System A System B Parameter Server time (hour) 1.1 TB (Criteo), variables, samples S3 stores files (no preing) - better IO library 5 machines (c4.8xlarge), 1000s time, 220 MB/s learning 2014

13 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)

14 p(y x, w) A Linear Model is not enough

15 Factorization Machines Linear Model f(x) =hw, xi memory hog Polynomial Expansion (Rendle, 2012) f(x) =hw, xi + X i<j x i x j tr V (2) V (2) i j + X i<j<k x i x j x k tr V (3) V (3) V (3) i j k +... too large for individual machine

16 Prefetching to the rescue Most keys are infrequent (power law distribution) Prefetch the embedding vectors for a minibatch from t=1 t=1 t=2 t=2 t=3 t= parameter server Compute gradients and push to server Variable dimensionality embedding Enforcing sparsity (ANOVA style) Adaptive gradient normalization Frequency adaptive regularization (CF style)

17 Better Models what relative logloss (%) everyone else does (Criteo 1TB) no mem adaption freqency freqency + l1 shrk k

18 Faster Solver (small Criteo) test logloss LibFM DiFacto, 1 worker DiFacto, 10 workers LibFM died on large models sec

19 Multiple Machines Li, Wang, Liu, Smola, WSDM 16, submitted x LibFM speed on 16 machines speedup (x) Criteo2 CTR # of machines

20 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)

21 Recommender Systems Users u, movies m (or projects) Function class rum = hvu, wm i + bu + bm Loss function for recommendation (Yelp, Netflix) X u m (hvu, wm i + bu + bm yum ) 2

22 Recommender Systems Regularized Objective X u m (hv u,w m i + b u + b m + b 0 r um ) hkuk 2 Frob + kv k2 Frob Update operations v u (1 t )v u t w m (hv u,w m i + b u + b m + b 0 r um ) w m (1 t )w m t v u (hv u,w m i + b u + b m + b 0 r um ) Very simple SGD algorithm (random pairs) This should be cheap i memory subsystem

23 This should be cheap O(md) burst reads and O(m) random reads Netflix dataset m = 100 million, d = 2048 dimensions, 30 steps Runtime should be > 4500s 60 GB/s memory bandwidth = 3300s 100 ns random reads = 1200s We get 560s. Why? Liu, Wang, Smola, RecSys 2015

24 Power law in Collaborative Filtering 10 4 Netflix dataset 10 3 # movies # ratings

25 Key Ideas Stratify ratings by users (only 1 cache miss / read per user / out of core) Keep frequent movies in cache (stratify by blocks of movie popularity) Avoid false sharing between sockets (key cached in the wrong CPU causes miss)

26 Key Ideas GraphChi Partitioning

27 Key Ideas SC-SGD partitioning

28 Speed (c4.8xlarge) C-SGD Fast SGLD Graphchi Graphlab BIDMach 5000 C-SGD Fast SGLD Graphchi Graphlab Seconds g2.8xlarge Seconds Num of Dimensions Netflix - 100M, 15 iterations Num of Dimensions Yahoo - 250M, 30 iterations

29 Convergence Testing RMSE C-SGD at k=2048 GraphChi SGD at k=2048 Fast SGLD at k=2048 GraphChi blocks (users, movies) into random groups Poor mixing Slow convergence Runtime in sec (30 epoches in total)

30 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images) github.com/dmlc

31 The Challenge Multiple good single-machine toolkits Caffe - convolution optimized (images) CXXNET - good tensor library Minerva - Scheduler & Layout on CPU/GPU Torch - Lua + interesting C preor (very very popular, though) Theano - Deep network compiler built by ML Don t reinvent the wheel for deep learning Integrate with parameter server

32 Minerva (dmlc/minerva) Tensor interface in python (similar to numpy) Dataflow engine Auto parallel execution On multi-core CPU On multi-gpu Optimizes layout automatically Zhang et al, 14 (NIPS workshop)

33 Minerva Scaling Images/ second AlexNet VGGNet GoogLeNet 1 card cards cards

34 Distributed Deep Learning

35 Distributed Deep Learning

36 Scaling on AWS g2.2xlarge 1Gbit network limit (alexnet scaling)

37 Amazon just released g2.8xlarge 12 instances (48 $0.50/h spot Minibatch size 512 BSP with 1 delay between machines 2 GB/s bandwidth between machines (awful) , , , , , , , , , , , (all over the place in availability zone) Compressing to 1 byte per coordinate helps a bit but adds latency due to extra pass (need to fix) 37x speedup on 48 GPUS Imagenet 12 dataset in trained in 4h, i.e. $24 (with alexnet; googlenet even better for network)

38 Summary Parameter Server Basics Data (local or cloud) read Logistic Regression local state (or copy) Large Distributed State Factorization Machines Server write update Data (local or cloud) Data (local or cloud) Memory Subsystem read We are hiring! Matrix Factorization write update local state (or copy) GPUs Deep Learning Server Data (local or cloud) Data (local or cloud) read Much more - Topic Models, NLP local state (or copy) Docker, Sketches, Fault Tolerance write Data (local or cloud) update

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine