T. HOEFLER

Size: px

Start display at page:

Download "T. HOEFLER"

Garey Hunter
5 years ago
Views:

1 T HOEFLER Demystifying Parallel and Distributed Deep Learning - An In-Depth Concurrency Analysis Keynote at the 6 th Accelerated Data Analytics and Computing Workshop (ADAC 18)

2 What is Deep Learning good for? spclinfethzch Digit Recognition Object Classification Segmentation Image Captioning Gameplay AI Translation Neural Computers A very promising area of research! 23 papers per day! 1989 number of papers per year

3 How does Deep Learning work? Canziani et al 217 Number of users 8 bn Deep Learning is f(x) Supercomputing! Cat Dog Airplane Horse Bicycle Truck Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update ImageNet (1k): 18 GB ImageNet (22k): A few TB Industry: Much larger 1-2 layers deep ~1M-2B parameters 1-8 GiB parameter storage 1-22k labels growing (eg, face recognition) weeks to train

4 A brief theory of supervised deep learning f(x) Cat Dog Airplane Horse Bicycle Truck Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update labeled samples x X D label domain Y true label l(x) f x : X Y network structure (fixed) weights w (learned) w = argmin w R d E x~d l w, x f x = f n convolution 1 convolution 2 f n 1 f n 2 f 1 x f 1 x f 2 f 1 x f(x) pooling convolution 3 fully connected l sq w, x = f x l x 2 l 1 w, x = ቊ f x = l(x) 1 f x l(x) l ce w, x = l x i log σ k e f x k i ef x i 4

5 Stochastic Gradient Descent w = argmin w R d E x~d l w, x f 1 (x) f 2 f 1 x f(x) convolution 1 convolution 2 pooling convolution 3 fully connected Layer storage = w l + f l o l 1 + w l + o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 5

used Shared vs distributed memory Deep Learning is largely on distributed memory today!

6 Trends in deep learning: hardware and multi-node spclinfethzch The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs distributed memory Deep Learning is largely on distributed memory today! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 6

learning Communication mode Deep Learning research is converging to MPI!

7 Trends in distributed deep learning: node count and communication The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 7

Truck 2 Truck T Ben-Nun, T Hoefler: Demystifying Parallel and

8 Minibatch Stochastic Gradient Descent (SGD) Cat 54 Cat 1 Dog Airplane Horse Bicycle Dog Airplane Horse Bicycle Truck 2 Truck T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 8

9 A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree Butterfly Pipeline RedScat+Gat Work W = 39 Depth D = 7 Average parallelism = W D T = 2L log 2 P + 2γmG log 2 P T = L log 2 P + γmg log 2 P T = 2L(P 1) + 2γmG(P 1)/P Lower bound: T L log 2 P + 2γmG P 1 /P T = 2L log 2 P + 2γmG(P 1)/P E Chan et al: Collective communication: theory, practice, and experience CCPE 7 9 TH, D Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI 14

10 GoogLeNet in more detail ~68M parameters 22 layers deep C Szegedy et al Going Deeper with Convolutions, CVPR 15 1

11 Parallelism in the different layer types f l x w o l * = f l x w o l W is linear and D logarithmic large average parallelism f l x w o l f l x w o l f l x w o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb

12 Computing fully connected layers f l x w o l x 1 x 2 x 3 w 1,2 w 1,1 w 2,1 w 2,2 w 3,1 w 3,2 σ σw i,1x i + b 1 σ σw i,2x i + b 2 x 1,1 x 1,2 x 1,3 1 x 2,1 x 2,2 x 2,3 1 x N,1 x N,2 x N,3 1 w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 b 1 b 2 13

Computing convolutional layers Direct Indirect Direct 4

539 439-63 168 123 12 96 153 258 14 4 71 521 531 FFT

Efficient Primitives for Deep Learning, arxiv 214 X Liu

Networks, ICLR 17 Workshop K Chellapilla et al: High

Processing, Int l Workshop on Frontiers in Handwriting

Convolutional Networks through FFTs, ICLR 14 A Lavin

13 Computing convolutional layers Direct Indirect Direct * = FFT Winograd w F w im2col F F 1 = S Chetlur et al: cudnn: Efficient Primitives for Deep Learning, arxiv 214 X Liu et al: Efficient Sparse-Winograd Convolutional Neural Networks, ICLR 17 Workshop K Chellapilla et al: High Performance Convolutional Neural Networks for Document Processing, Int l Workshop on Frontiers in Handwriting Recognition 216 M Mathieu et al: Fast Training of Convolutional Networks through FFTs, ICLR 14 A Lavin and S Gray: Fast Algorithms for Convolutional Neural Networks, CVPR 16 14

Microbatching (µ-cudnn) Fast In cudnn (up to 454x

implementations Performance depends on temporary

into microbatches, reuse workspace, use different

ILP Utilizes heterogeneous clusters P1-SXM2 Integer

14 Microbatching (µ-cudnn) Fast In cudnn (up to 454x there on are DeepBench) ~16 convolution implementations Performance depends on temporary memory (workspace) size Key idea: segment minibatch into microbatches, reuse workspace, use different algorithms How to choose microbatch sizes and algorithms? Memory efficient Dynamic Programming (Space Reuse) DP ILP Utilizes heterogeneous clusters P1-SXM2 Integer Linear Programming (Space Sharing) Oyama et al: µ-cudnn: Accelerating Deep Learning Frameworks with Micro-Batching, arxiv

15 spclinfethzch Model parallelism 1 3 Parameters can be distributed across processors Mini-batch has to be copied to all processors Backpropagation requires all-to-all communication every layer UA Muller and A Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int l Conf on Neural Networks

16 spclinfethzch Pipeline parallelism Layers/parameters can be distributed across processors Sparse communication pattern (only pipeline stages) Mini-batch has to be copied through all processors G Blelloch and CR Rosenberg: Network Learning on the Connection Machine, IJCAI 87 17

17 spclinfethzch Data parallelism Simple and efficient solution, easy to implement Duplicate parameters at all processors X Zhang et al: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS 89 18

18 spclinfethzch Hybrid parallelism Data Parallelism Model Parallelism Layer (pipeline) Parallelism Layers/parameters can be distributed across processors Can distribute minibatch Often specific to layer-types (eg, distribute fc layers but handle conv layers data-parallel) Enables arbitrary combinations of data, model, and pipeline parallelism very powerful! A Krizhevsky: One weird trick for parallelizing convolutional neural networks, arxiv 214 J Dean et al: Large scale distributed deep networks, NIPS 12 T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb

Updating parameters in distributed data parallelism Decentral collective allreduce of w Central

T = 2L + 2P γm/s G - Collective operations - Topologies - Neighborhood collectives - RMA?

Tradeoff in Distributed Deep Learning: A Systematic Study ICDM 16 Adaptive Minibatch Size S L Smith et

19 Updating parameters in distributed data parallelism Decentral collective allreduce of w Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) w = u(w, w) w w T = 2L + 2P γm/s G - Collective operations - Topologies - Neighborhood collectives - RMA? T = 2L log 2 P + 2γmG(P 1)/P Hierarchical Parameter Server S Gupta et al: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study ICDM 16 Adaptive Minibatch Size S L Smith et al: Don't Decay the Learning Rate, Increase the Batch Size, arxiv 217 Training Agent Training Agent Training Agent Training Agent 2

20 Parameter (and Model) consistency - centralized Parameter exchange frequency can be controlled, while still attaining convergence: w parameter server (sharded) w = u(w, w) w Agent 1 w 1 Synchronization w 2 Sync Agent 1 Max Staleness w 1,1 w 2,1 w 3,1 w 4,1 Sync Agent 1 Training Agent Training Agent Training Agent Training Agent w 1,1 w 2,1 w 3,1 Parameter Server w w T w Parameter Server w T Parameter Server w w T Agent m w 1 w 2 Agent m w 1,m w 2,m Agent m w 1,m w 2,m w 3,m Time Synchronous Time Stale Synchronous / Bounded Asynchronous Started with Hogwild! [Niu et al 211] shared memory, by chance DistBelief [Dean et al 212] moved the idea to distributed Trades off statistical performance for hardware performance Time Asynchronous J Dean et al: Large scale distributed deep networks, NIPS 12 F Niu et al: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS 11 21

21 Merge spclinfethzch Parameter (and Model) consistency - decentralized Parameter exchange frequency can be controlled, while still attaining convergence: collective allreduce of w Training Agent Training Agent Training Agent Training Agent Max Staleness Agent 1 Agent m w () w 1 w 1 All- Reduce w 2 w 2 All- Reduce w T w T Agent 1 Agent m w () w 1,1 w 2,1 w 3,1 w 4,1 w 1,m All- Reduce w 2,m All- Reduce w (T) Agent 1 Agent r Agent k w 1,1 w 2,1 w 3,1 w 1,r w 2,r w 3,r w 4,r w 5,r w 1,k w 2,k w 3,k Agent m w 1,m w 2,m w 3,m Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous May also consider limited/slower distribution gossip [Jin et al 216] Peter H Jin et al, How to scale distributed deep learning?, NIPS MLSystems

22 Parameter consistency in deep learning Sync Agent 1 w 1,1 w 2,1 w 3,1 w 4,1 w 5,1 w 6,1 w Agent m Parameter Server Elastic Average w 1,m w 2,m w 3,m w 4,m w 5,m w 6,m w T Using physical forces between different versions of w: w t+1,i = w t,i η w t,i α w t,i w t m w t+1 = 1 β w t + β m i=1 w t,i Time Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent S Zhang et al: Deep learning with Elastic Averaging SGD, NIPS 15 23

23 Parameter consistency in deep learning Avg Cat Dog Airplane Horse Bicycle Truck Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent T G Dietterich: Ensemble Methods in Machine Learning, MCS 2 24

Communication optimizations Different options how to optimize updates Send w, receive w Send FC factors (o l 1, o l ), compute w on parameter server Broadcast factors to not receive full w Use lossy

Quantization Quantize weight updates and potentially weights Main trick is stochastic rounding [1] expectation is more accurate Enables low precision (half, quarter) to become standard TernGrad -

24 Communication optimizations Different options how to optimize updates Send w, receive w Send FC factors (o l 1, o l ), compute w on parameter server Broadcast factors to not receive full w Use lossy compression when sending, accumulate error locally! Quantization Quantize weight updates and potentially weights Main trick is stochastic rounding [1] expectation is more accurate Enables low precision (half, quarter) to become standard TernGrad - ternary weights [2], 1-bit SGD [3], Sparsification Do not send small weight updates or only send top-k [4] Accumulate them locally parameter server (sharded) w = u(w, w) w w Training Agent Training Agent Training Agent Training Agent source: aiintelcom [1] S Gupta et al Deep Learning with Limited Numerical Precision, ICML 15 [2] F Li and B Liu Ternary Weight Networks, arxiv 216 [3] F Seide et al 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 214 [4] C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv

25 SparCML Quantified sparse allreduce for decentral updates w 1 w 2 w 3 w C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 MNIST test accuracy 26

26 Hyperparameter and Architecture search Meta-optimization of hyper-parameters (momentum) and DNN architecture Using Reinforcement Learning [1] (explore/exploit different configurations) Genetic Algorithms with modified (specialized) mutations [2] Particle Swarm Optimization [3] and other meta-heuristics Reinforcement Learning [1] Evolutionary Algorithms [4] [1] M Jaderberg et al: Population Based Training of Neural Networks, arxiv 217 [2] E Real et al: Regularized Evolution for Image Classifier Architecture Search, arxiv 218 [3] P R Lorenzo et al: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO 17 [4] H Liu et al: Hierarchical Representations for Efficient Architecture Search, ICLR 18 27

FORTRAN Java OpenCL br i1 %cmp, label %LT, label %GE LT: %2 = fmul double %y, %y GE: %3 = fmul double 2, %y AFTER: %4 = phi double [%2,%LT], [%3,%GE] %5 = fadd double %4, 1 5 %cmp %x %y 2

27 Application: Neural Code Comprehension In 217, GitHub reports 1 billion git commits in 337 languages! Can DNNs understand code? Previous approaches read the code directly suboptimal (loops, functions) double thres = 5; %cmp = fcmp olt double %x, 5 if (x < thres) x = y * y; else x = 2 * y; x += 1; C/C++ Python CUDA FORTRAN Java OpenCL br i1 %cmp, label %LT, label %GE LT: %2 = fmul double %y, %y GE: %3 = fmul double 2, %y AFTER: %4 = phi double [%2,%LT], [%3,%GE] %5 = fadd double %4, 1 5 %cmp %x %y 2 %y %2 %3 %LT %2 %4 %3 %5 1 % %GE %AFTER Dataflow (basic blocks) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi contextual Flow Graph Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv

28 %cmp = fcmp olt double %x, 5 %3 = fmul double 2, %y spclinfethzch Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi 1 Embedding Dimensions 1 Vocabulary Size (#stmts) Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv

model) %x Predicts which device is faster

tiling Code Detection br %LT %2 phi %cmp

LSTM Units LSTM Units Guided Programming

29 Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x Predicts which device is faster (CPU or GPU) fcmp Optimal Malicious tiling Code Detection br %LT %2 phi %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi LSTM Units LSTM Units Guided Programming Code Optimization Optimal Hardware Mapping Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 3

30 Outlook Full details in the survey (6 pages) Parallelism, distribution, synchronization Additional content: Unsupervised (GAN/autoencoders) Recurrent (RNN/LSTM) Call to action to the HPC and ML/DL communities to join forces! It s already happening on the tool basis Need more joint events! 31

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object