T. HOEFLER

Similar documents
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

High-Performance Communication in Machine Learning RWTH Aachen, Jan. 2019

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

ECS289: Scalable Machine Learning

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Deep Learning Architectures and Algorithms

Deep Learning & Neural Networks Lecture 4

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Neural Networks and Deep Learning

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Advanced Training Techniques. Prajit Ramachandran

Deep Learning Recurrent Networks 2/28/2018

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Introduction to (Convolutional) Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

arxiv: v3 [cs.lg] 14 Jan 2018

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Lecture 14: Deep Generative Learning

Introduction to Machine Learning (67577)

Optimization for Training I. First-Order Methods Training algorithm

Outline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training

Classification of Hand-Written Digits Using Scattering Convolutional Network

Lecture 5 Neural models for NLP

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China

Neural Networks Language Models

Feature Design. Feature Design. Feature Design. & Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

FreezeOut: Accelerate Training by Progressively Freezing Layers

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Compressing deep neural networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Recurrent Neural Networks

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Identifying QCD transition using Deep Learning

Lecture 17: Neural Networks and Deep Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Introduction to Deep Learning

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

arxiv: v1 [cs.ne] 20 Apr 2018

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Introduction to Deep Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Recurrent Neural Network

Introduction to RNNs!

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

CSCI567 Machine Learning (Fall 2018)

Neural networks COMS 4771

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSC321 Lecture 16: ResNets and Attention

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Based on the original slides of Hung-yi Lee

UNSUPERVISED LEARNING

Deep Learning Lecture 2

A Parallel SGD method with Strong Convergence

Understanding How ConvNets See

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Introduction to Neural Networks

RegML 2018 Class 8 Deep learning

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Loss Functions and Optimization. Lecture 3-1

Convolutional Neural Networks

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Convolutional Neural Network Architecture

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Deep Learning Lab Course 2017 (Deep Learning Practical)

A Practitioner s Guide to MXNet

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

arxiv: v2 [stat.ml] 18 Jun 2017

From perceptrons to word embeddings. Simon Šuster University of Groningen

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Deep Residual. Variations

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Transcription:

T HOEFLER Demystifying Parallel and Distributed Deep Learning - An In-Depth Concurrency Analysis Keynote at the 6 th Accelerated Data Analytics and Computing Workshop (ADAC 18) https://wwwarxivorg/abs/1829941

What is Deep Learning good for? spclinfethzch Digit Recognition Object Classification Segmentation Image Captioning Gameplay AI Translation Neural Computers A very promising area of research! 23 papers per day! 1989 number of papers per year 212 213 214 216 217

How does Deep Learning work? Canziani et al 217 Number of users 8 bn Deep Learning is f(x) Supercomputing! Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update ImageNet (1k): 18 GB ImageNet (22k): A few TB Industry: Much larger 1-2 layers deep ~1M-2B parameters 1-8 GiB parameter storage 1-22k labels growing (eg, face recognition) weeks to train

A brief theory of supervised deep learning f(x) Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update labeled samples x X D label domain Y true label l(x) f x : X Y network structure (fixed) weights w (learned) w = argmin w R d E x~d l w, x f x = f n convolution 1 convolution 2 f n 1 f n 2 f 1 x f 1 x f 2 f 1 x f(x) pooling convolution 3 fully connected l sq w, x = f x l x 2 l 1 w, x = ቊ f x = l(x) 1 f x l(x) l ce w, x = l x i log σ k e f x k i ef x i 4

Stochastic Gradient Descent w = argmin w R d E x~d l w, x f 1 (x) f 2 f 1 x f(x) convolution 1 convolution 2 pooling convolution 3 fully connected Layer storage = w l + f l o l 1 + w l + o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 5

Trends in deep learning: hardware and multi-node spclinfethzch The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs distributed memory Deep Learning is largely on distributed memory today! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 6

Trends in distributed deep learning: node count and communication The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 7

Minibatch Stochastic Gradient Descent (SGD) Cat 54 Cat 1 Dog Airplane Horse Bicycle 28 7 4 3 2 Dog Airplane Horse Bicycle Truck 2 Truck T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 8

A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree Butterfly Pipeline RedScat+Gat Work W = 39 Depth D = 7 Average parallelism = W D T = 2L log 2 P + 2γmG log 2 P T = L log 2 P + γmg log 2 P T = 2L(P 1) + 2γmG(P 1)/P Lower bound: T L log 2 P + 2γmG P 1 /P T = 2L log 2 P + 2γmG(P 1)/P E Chan et al: Collective communication: theory, practice, and experience CCPE 7 9 TH, D Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI 14

GoogLeNet in more detail ~68M parameters 22 layers deep C Szegedy et al Going Deeper with Convolutions, CVPR 15 1

Parallelism in the different layer types f l x w o l 4 1 9 8 5 9 9 8 7 3 4 2 6 3 1 * 1-1 1-2 3 4 11 = 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 f l x w o l W is linear and D logarithmic large average parallelism f l x w o l 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 593 539 153 531 f l x w o l f l x w o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 12

Computing fully connected layers f l x w o l x 1 x 2 x 3 w 1,2 w 1,1 w 2,1 w 2,2 w 3,1 w 3,2 σ σw i,1x i + b 1 σ σw i,2x i + b 2 x 1,1 x 1,2 x 1,3 1 x 2,1 x 2,2 x 2,3 1 x N,1 x N,2 x N,3 1 w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 b 1 b 2 13

Computing convolutional layers Direct Indirect Direct 4 1 9 8 5 9 9 8 7 3 4 2 6 3 1 * 1-1 1-2 3 4 11 = 219 593 539 439-63 168 123 12 96 153 258 14 4 71 521 531 FFT Winograd w F w im2col F F 1 = S Chetlur et al: cudnn: Efficient Primitives for Deep Learning, arxiv 214 X Liu et al: Efficient Sparse-Winograd Convolutional Neural Networks, ICLR 17 Workshop K Chellapilla et al: High Performance Convolutional Neural Networks for Document Processing, Int l Workshop on Frontiers in Handwriting Recognition 216 M Mathieu et al: Fast Training of Convolutional Networks through FFTs, ICLR 14 A Lavin and S Gray: Fast Algorithms for Convolutional Neural Networks, CVPR 16 14

Microbatching (µ-cudnn) Fast In cudnn (up to 454x there on are DeepBench) ~16 convolution implementations Performance depends on temporary memory (workspace) size Key idea: segment minibatch into microbatches, reuse workspace, use different algorithms How to choose microbatch sizes and algorithms? Memory efficient Dynamic Programming (Space Reuse) DP ILP Utilizes heterogeneous clusters P1-SXM2 Integer Linear Programming (Space Sharing) Oyama et al: µ-cudnn: Accelerating Deep Learning Frameworks with Micro-Batching, arxiv 218 15

spclinfethzch Model parallelism 1 3 Parameters can be distributed across processors Mini-batch has to be copied to all processors Backpropagation requires all-to-all communication every layer UA Muller and A Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int l Conf on Neural Networks 1994 16

spclinfethzch Pipeline parallelism Layers/parameters can be distributed across processors Sparse communication pattern (only pipeline stages) Mini-batch has to be copied through all processors G Blelloch and CR Rosenberg: Network Learning on the Connection Machine, IJCAI 87 17

spclinfethzch Data parallelism Simple and efficient solution, easy to implement Duplicate parameters at all processors X Zhang et al: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS 89 18

spclinfethzch Hybrid parallelism Data Parallelism Model Parallelism Layer (pipeline) Parallelism Layers/parameters can be distributed across processors Can distribute minibatch Often specific to layer-types (eg, distribute fc layers but handle conv layers data-parallel) Enables arbitrary combinations of data, model, and pipeline parallelism very powerful! A Krizhevsky: One weird trick for parallelizing convolutional neural networks, arxiv 214 J Dean et al: Large scale distributed deep networks, NIPS 12 T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 19

Updating parameters in distributed data parallelism Decentral collective allreduce of w Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) w = u(w, w) w w T = 2L + 2P γm/s G - Collective operations - Topologies - Neighborhood collectives - RMA? T = 2L log 2 P + 2γmG(P 1)/P Hierarchical Parameter Server S Gupta et al: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study ICDM 16 Adaptive Minibatch Size S L Smith et al: Don't Decay the Learning Rate, Increase the Batch Size, arxiv 217 Training Agent Training Agent Training Agent Training Agent 2

Parameter (and Model) consistency - centralized Parameter exchange frequency can be controlled, while still attaining convergence: w parameter server (sharded) w = u(w, w) w Agent 1 w 1 Synchronization w 2 Sync Agent 1 Max Staleness w 1,1 w 2,1 w 3,1 w 4,1 Sync Agent 1 Training Agent Training Agent Training Agent Training Agent w 1,1 w 2,1 w 3,1 Parameter Server w w T w Parameter Server w T Parameter Server w w T Agent m w 1 w 2 Agent m w 1,m w 2,m Agent m w 1,m w 2,m w 3,m Time Synchronous Time Stale Synchronous / Bounded Asynchronous Started with Hogwild! [Niu et al 211] shared memory, by chance DistBelief [Dean et al 212] moved the idea to distributed Trades off statistical performance for hardware performance Time Asynchronous J Dean et al: Large scale distributed deep networks, NIPS 12 F Niu et al: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS 11 21

Merge spclinfethzch Parameter (and Model) consistency - decentralized Parameter exchange frequency can be controlled, while still attaining convergence: collective allreduce of w Training Agent Training Agent Training Agent Training Agent Max Staleness Agent 1 Agent m w () w 1 w 1 All- Reduce w 2 w 2 All- Reduce w T w T Agent 1 Agent m w () w 1,1 w 2,1 w 3,1 w 4,1 w 1,m All- Reduce w 2,m All- Reduce w (T) Agent 1 Agent r Agent k w 1,1 w 2,1 w 3,1 w 1,r w 2,r w 3,r w 4,r w 5,r w 1,k w 2,k w 3,k Agent m w 1,m w 2,m w 3,m Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous May also consider limited/slower distribution gossip [Jin et al 216] Peter H Jin et al, How to scale distributed deep learning?, NIPS MLSystems 216 22

Parameter consistency in deep learning Sync Agent 1 w 1,1 w 2,1 w 3,1 w 4,1 w 5,1 w 6,1 w Agent m Parameter Server Elastic Average w 1,m w 2,m w 3,m w 4,m w 5,m w 6,m w T Using physical forces between different versions of w: w t+1,i = w t,i η w t,i α w t,i w t m w t+1 = 1 β w t + β m i=1 w t,i Time Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent S Zhang et al: Deep learning with Elastic Averaging SGD, NIPS 15 23

Parameter consistency in deep learning Avg Cat Dog Airplane Horse Bicycle Truck 54 28 7 4 33 2 2 Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent T G Dietterich: Ensemble Methods in Machine Learning, MCS 2 24

Communication optimizations Different options how to optimize updates Send w, receive w Send FC factors (o l 1, o l ), compute w on parameter server Broadcast factors to not receive full w Use lossy compression when sending, accumulate error locally! Quantization Quantize weight updates and potentially weights Main trick is stochastic rounding [1] expectation is more accurate Enables low precision (half, quarter) to become standard TernGrad - ternary weights [2], 1-bit SGD [3], Sparsification Do not send small weight updates or only send top-k [4] Accumulate them locally parameter server (sharded) w = u(w, w) w w Training Agent Training Agent Training Agent Training Agent source: aiintelcom [1] S Gupta et al Deep Learning with Limited Numerical Precision, ICML 15 [2] F Li and B Liu Ternary Weight Networks, arxiv 216 [3] F Seide et al 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 214 [4] C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 25

SparCML Quantified sparse allreduce for decentral updates w 1 w 2 w 3 w 4 + + + + C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 MNIST test accuracy 26

Hyperparameter and Architecture search Meta-optimization of hyper-parameters (momentum) and DNN architecture Using Reinforcement Learning [1] (explore/exploit different configurations) Genetic Algorithms with modified (specialized) mutations [2] Particle Swarm Optimization [3] and other meta-heuristics Reinforcement Learning [1] Evolutionary Algorithms [4] [1] M Jaderberg et al: Population Based Training of Neural Networks, arxiv 217 [2] E Real et al: Regularized Evolution for Image Classifier Architecture Search, arxiv 218 [3] P R Lorenzo et al: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO 17 [4] H Liu et al: Hierarchical Representations for Efficient Architecture Search, ICLR 18 27

Application: Neural Code Comprehension In 217, GitHub reports 1 billion git commits in 337 languages! Can DNNs understand code? Previous approaches read the code directly suboptimal (loops, functions) double thres = 5; %cmp = fcmp olt double %x, 5 if (x < thres) x = y * y; else x = 2 * y; x += 1; C/C++ Python CUDA FORTRAN Java OpenCL br i1 %cmp, label %LT, label %GE LT: %2 = fmul double %y, %y GE: %3 = fmul double 2, %y AFTER: %4 = phi double [%2,%LT], [%3,%GE] %5 = fadd double %4, 1 5 %cmp %x %y 2 %y %2 %3 %LT %2 %4 %3 %5 1 % %GE %AFTER Dataflow (basic blocks) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi contextual Flow Graph Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 28

%cmp = fcmp olt double %x, 5 %3 = fmul double 2, %y spclinfethzch Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi 1 Embedding Dimensions 1 Vocabulary Size (#stmts) Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 29

Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x Predicts which device is faster (CPU or GPU) fcmp Optimal Malicious tiling Code Detection br %LT %2 phi %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi LSTM Units LSTM Units Guided Programming Code Optimization Optimal Hardware Mapping Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 3

Outlook https://wwwarxivorg/abs/1829941 Full details in the survey (6 pages) Parallelism, distribution, synchronization Additional content: Unsupervised (GAN/autoencoders) Recurrent (RNN/LSTM) Call to action to the HPC and ML/DL communities to join forces! It s already happening on the tool basis Need more joint events! 31