T HOEFLER Demystifying Parallel and Distributed Deep Learning - An In-Depth Concurrency Analysis Keynote at the 6 th Accelerated Data Analytics and Computing Workshop (ADAC 18) https://wwwarxivorg/abs/1829941
What is Deep Learning good for? spclinfethzch Digit Recognition Object Classification Segmentation Image Captioning Gameplay AI Translation Neural Computers A very promising area of research! 23 papers per day! 1989 number of papers per year 212 213 214 216 217
How does Deep Learning work? Canziani et al 217 Number of users 8 bn Deep Learning is f(x) Supercomputing! Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update ImageNet (1k): 18 GB ImageNet (22k): A few TB Industry: Much larger 1-2 layers deep ~1M-2B parameters 1-8 GiB parameter storage 1-22k labels growing (eg, face recognition) weeks to train
A brief theory of supervised deep learning f(x) Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update labeled samples x X D label domain Y true label l(x) f x : X Y network structure (fixed) weights w (learned) w = argmin w R d E x~d l w, x f x = f n convolution 1 convolution 2 f n 1 f n 2 f 1 x f 1 x f 2 f 1 x f(x) pooling convolution 3 fully connected l sq w, x = f x l x 2 l 1 w, x = ቊ f x = l(x) 1 f x l(x) l ce w, x = l x i log σ k e f x k i ef x i 4
Stochastic Gradient Descent w = argmin w R d E x~d l w, x f 1 (x) f 2 f 1 x f(x) convolution 1 convolution 2 pooling convolution 3 fully connected Layer storage = w l + f l o l 1 + w l + o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 5
Trends in deep learning: hardware and multi-node spclinfethzch The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs distributed memory Deep Learning is largely on distributed memory today! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 6
Trends in distributed deep learning: node count and communication The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 7
Minibatch Stochastic Gradient Descent (SGD) Cat 54 Cat 1 Dog Airplane Horse Bicycle 28 7 4 3 2 Dog Airplane Horse Bicycle Truck 2 Truck T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 8
A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree Butterfly Pipeline RedScat+Gat Work W = 39 Depth D = 7 Average parallelism = W D T = 2L log 2 P + 2γmG log 2 P T = L log 2 P + γmg log 2 P T = 2L(P 1) + 2γmG(P 1)/P Lower bound: T L log 2 P + 2γmG P 1 /P T = 2L log 2 P + 2γmG(P 1)/P E Chan et al: Collective communication: theory, practice, and experience CCPE 7 9 TH, D Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI 14
GoogLeNet in more detail ~68M parameters 22 layers deep C Szegedy et al Going Deeper with Convolutions, CVPR 15 1
Parallelism in the different layer types f l x w o l 4 1 9 8 5 9 9 8 7 3 4 2 6 3 1 * 1-1 1-2 3 4 11 = 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 f l x w o l W is linear and D logarithmic large average parallelism f l x w o l 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 593 539 153 531 f l x w o l f l x w o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 12
Computing fully connected layers f l x w o l x 1 x 2 x 3 w 1,2 w 1,1 w 2,1 w 2,2 w 3,1 w 3,2 σ σw i,1x i + b 1 σ σw i,2x i + b 2 x 1,1 x 1,2 x 1,3 1 x 2,1 x 2,2 x 2,3 1 x N,1 x N,2 x N,3 1 w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 b 1 b 2 13
Computing convolutional layers Direct Indirect Direct 4 1 9 8 5 9 9 8 7 3 4 2 6 3 1 * 1-1 1-2 3 4 11 = 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 FFT Winograd w F w im2col F F 1 = S Chetlur et al: cudnn: Efficient Primitives for Deep Learning, arxiv 214 X Liu et al: Efficient Sparse-Winograd Convolutional Neural Networks, ICLR 17 Workshop K Chellapilla et al: High Performance Convolutional Neural Networks for Document Processing, Int l Workshop on Frontiers in Handwriting Recognition 216 M Mathieu et al: Fast Training of Convolutional Networks through FFTs, ICLR 14 A Lavin and S Gray: Fast Algorithms for Convolutional Neural Networks, CVPR 16 14
Microbatching (µ-cudnn) Fast In cudnn (up to 454x there on are DeepBench) ~16 convolution implementations Performance depends on temporary memory (workspace) size Key idea: segment minibatch into microbatches, reuse workspace, use different algorithms How to choose microbatch sizes and algorithms? Memory efficient Dynamic Programming (Space Reuse) DP ILP Utilizes heterogeneous clusters P1-SXM2 Integer Linear Programming (Space Sharing) Oyama et al: µ-cudnn: Accelerating Deep Learning Frameworks with Micro-Batching, arxiv 218 15
spclinfethzch Model parallelism 1 3 Parameters can be distributed across processors Mini-batch has to be copied to all processors Backpropagation requires all-to-all communication every layer UA Muller and A Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int l Conf on Neural Networks 1994 16
spclinfethzch Pipeline parallelism Layers/parameters can be distributed across processors Sparse communication pattern (only pipeline stages) Mini-batch has to be copied through all processors G Blelloch and CR Rosenberg: Network Learning on the Connection Machine, IJCAI 87 17
spclinfethzch Data parallelism Simple and efficient solution, easy to implement Duplicate parameters at all processors X Zhang et al: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS 89 18
spclinfethzch Hybrid parallelism Data Parallelism Model Parallelism Layer (pipeline) Parallelism Layers/parameters can be distributed across processors Can distribute minibatch Often specific to layer-types (eg, distribute fc layers but handle conv layers data-parallel) Enables arbitrary combinations of data, model, and pipeline parallelism very powerful! A Krizhevsky: One weird trick for parallelizing convolutional neural networks, arxiv 214 J Dean et al: Large scale distributed deep networks, NIPS 12 T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 19
Updating parameters in distributed data parallelism Decentral collective allreduce of w Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) w = u(w, w) w w T = 2L + 2P γm/s G - Collective operations - Topologies - Neighborhood collectives - RMA? T = 2L log 2 P + 2γmG(P 1)/P Hierarchical Parameter Server S Gupta et al: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study ICDM 16 Adaptive Minibatch Size S L Smith et al: Don't Decay the Learning Rate, Increase the Batch Size, arxiv 217 Training Agent Training Agent Training Agent Training Agent 2
Parameter (and Model) consistency - centralized Parameter exchange frequency can be controlled, while still attaining convergence: w parameter server (sharded) w = u(w, w) w Agent 1 w 1 Synchronization w 2 Sync Agent 1 Max Staleness w 1,1 w 2,1 w 3,1 w 4,1 Sync Agent 1 Training Agent Training Agent Training Agent Training Agent w 1,1 w 2,1 w 3,1 Parameter Server w w T w Parameter Server w T Parameter Server w w T Agent m w 1 w 2 Agent m w 1,m w 2,m Agent m w 1,m w 2,m w 3,m Time Synchronous Time Stale Synchronous / Bounded Asynchronous Started with Hogwild! [Niu et al 211] shared memory, by chance DistBelief [Dean et al 212] moved the idea to distributed Trades off statistical performance for hardware performance Time Asynchronous J Dean et al: Large scale distributed deep networks, NIPS 12 F Niu et al: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS 11 21
Merge spclinfethzch Parameter (and Model) consistency - decentralized Parameter exchange frequency can be controlled, while still attaining convergence: collective allreduce of w Training Agent Training Agent Training Agent Training Agent Max Staleness Agent 1 Agent m w () w 1 w 1 All- Reduce w 2 w 2 All- Reduce w T w T Agent 1 Agent m w () w 1,1 w 2,1 w 3,1 w 4,1 w 1,m All- Reduce w 2,m All- Reduce w (T) Agent 1 Agent r Agent k w 1,1 w 2,1 w 3,1 w 1,r w 2,r w 3,r w 4,r w 5,r w 1,k w 2,k w 3,k Agent m w 1,m w 2,m w 3,m Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous May also consider limited/slower distribution gossip [Jin et al 216] Peter H Jin et al, How to scale distributed deep learning?, NIPS MLSystems 216 22
Parameter consistency in deep learning Sync Agent 1 w 1,1 w 2,1 w 3,1 w 4,1 w 5,1 w 6,1 w Agent m Parameter Server Elastic Average w 1,m w 2,m w 3,m w 4,m w 5,m w 6,m w T Using physical forces between different versions of w: w t+1,i = w t,i η w t,i α w t,i w t m w t+1 = 1 β w t + β m i=1 w t,i Time Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent S Zhang et al: Deep learning with Elastic Averaging SGD, NIPS 15 23
Parameter consistency in deep learning Avg Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent T G Dietterich: Ensemble Methods in Machine Learning, MCS 2 24
Communication optimizations Different options how to optimize updates Send w, receive w Send FC factors (o l 1, o l ), compute w on parameter server Broadcast factors to not receive full w Use lossy compression when sending, accumulate error locally! Quantization Quantize weight updates and potentially weights Main trick is stochastic rounding [1] expectation is more accurate Enables low precision (half, quarter) to become standard TernGrad - ternary weights [2], 1-bit SGD [3], Sparsification Do not send small weight updates or only send top-k [4] Accumulate them locally parameter server (sharded) w = u(w, w) w w Training Agent Training Agent Training Agent Training Agent source: aiintelcom [1] S Gupta et al Deep Learning with Limited Numerical Precision, ICML 15 [2] F Li and B Liu Ternary Weight Networks, arxiv 216 [3] F Seide et al 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 214 [4] C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 25
SparCML Quantified sparse allreduce for decentral updates w 1 w 2 w 3 w 4 + + + + C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 MNIST test accuracy 26
Hyperparameter and Architecture search Meta-optimization of hyper-parameters (momentum) and DNN architecture Using Reinforcement Learning [1] (explore/exploit different configurations) Genetic Algorithms with modified (specialized) mutations [2] Particle Swarm Optimization [3] and other meta-heuristics Reinforcement Learning [1] Evolutionary Algorithms [4] [1] M Jaderberg et al: Population Based Training of Neural Networks, arxiv 217 [2] E Real et al: Regularized Evolution for Image Classifier Architecture Search, arxiv 218 [3] P R Lorenzo et al: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO 17 [4] H Liu et al: Hierarchical Representations for Efficient Architecture Search, ICLR 18 27
Application: Neural Code Comprehension In 217, GitHub reports 1 billion git commits in 337 languages! Can DNNs understand code? Previous approaches read the code directly suboptimal (loops, functions) double thres = 5; %cmp = fcmp olt double %x, 5 if (x < thres) x = y * y; else x = 2 * y; x += 1; C/C++ Python CUDA FORTRAN Java OpenCL br i1 %cmp, label %LT, label %GE LT: %2 = fmul double %y, %y GE: %3 = fmul double 2, %y AFTER: %4 = phi double [%2,%LT], [%3,%GE] %5 = fadd double %4, 1 5 %cmp %x %y 2 %y %2 %3 %LT %2 %4 %3 %5 1 % %GE %AFTER Dataflow (basic blocks) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi contextual Flow Graph Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 28
%cmp = fcmp olt double %x, 5 %3 = fmul double 2, %y spclinfethzch Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi 1 Embedding Dimensions 1 Vocabulary Size (#stmts) Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 29
Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x Predicts which device is faster (CPU or GPU) fcmp Optimal Malicious tiling Code Detection br %LT %2 phi %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi LSTM Units LSTM Units Guided Programming Code Optimization Optimal Hardware Mapping Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 3
Outlook https://wwwarxivorg/abs/1829941 Full details in the survey (6 pages) Parallelism, distribution, synchronization Additional content: Unsupervised (GAN/autoencoders) Recurrent (RNN/LSTM) Call to action to the HPC and ML/DL communities to join forces! It s already happening on the tool basis Need more joint events! 31