Outline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training

Size: px

Start display at page:

Download "Outline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training"

Muriel McBride
6 years ago
Views:

3 Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions

4 Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions

5 Deep Learning at Services Skype Translator Cortana Bing HoloLens Research

6 24% 14%

7 ImageNet: 2015 ResNet 28.2 ImageNet Classification top-5 error (%) ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 ResNet had all 5 entries being the 1-st places in 2015: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

8 Services

11 Image Similarity Goal: given query image, find similar images. Customer: Anonymous ISV (Azure Partner) Task: given a retail image, find same product on competitor websites (to compare price). Existing solution: solely based on mining text information from the websites of Target, Macy, etc. Customer asked for individual similarity measure (e.g. texture, neck style, etc).

12 Bing / Bing Ads

13 Translator Power point-plug in for translating speech to subtitles

14 Youtube Link

15 Customer Support Agent

(CNTK) s open-source deep-learning toolkit https://github.com//cntk Created by Speech researchers (Dong Yu et al.

21 (CNTK) s open-source deep-learning toolkit Created by Speech researchers (Dong Yu et al.) in 2012, Computational Network On GitHub since Jan 2016 under MIT license Python support since Oct 2016 (beta), rebranded as External contributions e.g. from MIT, Stanford and Nvidia Over 80% internal DL workload runs CNTK 1st-class on Linux and Windows, docker support Python, C++, C#, Java, Keras Internal == External

22 CTNK Speed Benchmarking by HKBU, Version 8 Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 Caffe: 1.0rc5(39f28e4) CNTK: 2.0 Beta10(1ae666d) MXNet: 0.93(32dc3a2) TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) ms ms ms ms ms AlexNet (256) ms ms ms ms ms ResNet (32) ms ms ms ms ms LSTM (256) (v7 benchmark) ms (44.917ms) ms ( ms) - ( ms) ms ( ms)

23 CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-gpu/multi-server speed comparison (samples/second), higher = better [note: December 2015] Achieved with 1-bit gradient quantization algorithm Theano only supports 1 GPU CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

images / sec MICROSOFT COGNITIVE TOOLKIT First Deep Learning Framework Fully Optimized for Pascal Delivering Near-Linear Multi-GPU Scaling AlexNet Performance 14,000 12,000 10,000 13,000 170x Faster

24 images / sec MICROSOFT COGNITIVE TOOLKIT First Deep Learning Framework Fully Optimized for Pascal Delivering Near-Linear Multi-GPU Scaling AlexNet Performance 14,000 12,000 10,000 13, x Faster v. CPU Server 8,000 7,600 6,000 4,000 2,000 2,400 3, Dual Socket CPU Server 1x P100 2x P100 4x P100 DGX-1 (8x P100) AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores) CNTK 2.0b3 (to be released) includes cudnn 5.1.8, NCCL 1.6.1, NVLink enabled

25 Superior performance

26 CNTK Scalability ResNet50 on DGX-1, tested by Nvidia, Dec. 2016

27 CNTK Other Advantages Python and C++ API Mostly implemented in C++ Low level + high level Python API Extensibility User functions and learners in pure Python Readers Distributed, highly efficient built-in data readers Internal == External

28 CNTK Other Advantages Keras interoperability With CNTK backend Try models trained with TF/Theano with CNTK (vice versa) Demo RL with DQN Binary evaluation 10x speed-up in model execution

29 Reinforcement Learning with Flappy Bird

30 Binary evaluation Conv-Net Binary Conv-Net

31 Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions

32 What is CNTK? CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

33 What is CNTK? Example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid W2 + b2) P = softmax(w out h 2 + b out ) P = softmax Wout + bout) with input x R M and one-hot label L R M and cross-entropy training criterion ce = L T log P ce = cross_entropy (L, P) = max Scorpus ce

34 What is CNTK? Example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid W2 + b2) P = softmax(w out h 2 + b out ) P = softmax Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (L, P) = max Scorpus ce

35 What is CNTK? example: 2-hidden layer feed-forward NN h 1 = s(w 1 x + b 1 ) h1 = sigmoid W1 + b1) h 2 = s(w 2 h 1 + b 2 ) h2 = sigmoid W2 + b2) P = softmax(w out h 2 + b out ) P = softmax Wout + bout) with input x R M and one-hot label y R J and cross-entropy training criterion ce = y T log P ce = cross_entropy (P, y) = max Scorpus ce

36 What is CNTK? h1 = sigmoid W1 + b1) h2 = sigmoid W2 + b2) P = softmax Wout + bout) ce = cross_entropy (P, y)

37 What is CNTK? b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s h1 = sigmoid W1 + b1) h2 = sigmoid W2 + b2) P = softmax Wout + bout) ce = cross_entropy (P, y) b 1 W 1 + x y

38 What is CNTK? b out W out b 2 W 2 ce cross_entropy P softmax + h 2 s + h 1 s Nodes: functions (primitives) Can be composed into reusable composites Edges: values Incl. tensors, sparse Automatic differentiation F / in = F / out out / in Deferred computation execution engine Editable, clonable b 1 W 1 + x y LEGO-like composability allows CNTK to support wide range of networks & applications

39 CNTK Unique Features Symbolic loops over sequences with dynamic scheduling Turn graph into parallel program through minibatching Unique parallel training algorithms (1-bit SGD, Block Momentum)

40 Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + H 1 h 1 (t-1) + b 1 ) h1 = W1 + past_value(h1) + b1) h 2 (t) = s(w 2 h 1 (t) + H 2 h 2 (t-1) + b 2 ) h2 = W2 + H2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t) no explicit notion of time

41 Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + R 1 h 1 (t-1) + b 1 ) h1 = W1 + R1 + b1) h 2 (t) = s(w 2 h 1 (t) + R 2 h 2 (t-1) + b 2 ) h2 = W2 + R2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t)

42 Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h 1 (t) = s(w 1 x(t) + R 1 h 1 (t-1) + b 1 ) h1 = W1 + R1 + b1) h 2 (t) = s(w 2 h 1 (t) + R 2 h 2 (t-1) + b 2 ) h2 = W2 + R2 + b2) P(t) = softmax(w out h 2 (t) + b out ) P = Wout + bout) ce(t) = L T (t) log P(t) ce = cross_entropy(p, L) = max Scorpus ce(t) compare to id (Irvine Dataflow): [Arvind et al., TR114a, Dept ISC, UC Irvine, Dec 1978; Executing a Program on the MIT Tagged-Token Dataflow Architecture, def ip(a: Sequence[tensor], b: Sequence[tensor]): s0 = 0 s_ = ForwardDeclaration() s = past_value(s_, initial_value=s0) + a * b s_.resolve_to(s) s = last(s) return s

43 Symbolic Loops over Sequential Data b out W out b 2 W 2 b 1 W 1 ce cross_entropy P softmax + h 2 s + h 1 s + + x z -1 + R 2 z -1 R 1 y h1 = W1 + R1 + b1) h2 = W2 + R2 + b2) P = Wout + bout) ce = cross_entropy(p, L) CNTK automatically unrolls cycles at execution time cycles are detected with Tarjan s algorithm only nodes in cycles efficient and composable cf. TensorFlow: [ lstm = rnn_cell.basiclstmcell(lstm_size) state = tf.zeros([batch_size, lstm.state_size] probabilities = [] loss = 0.0 for current_batch_of_words in words_in_dataset: output, state = lstm(current_batch_of_words, state) logits = tf.matmul(output, softmax_w) + softmax_b probabilities.append(tf.nn.softmax(logits)) loss += loss_function(probabilities, target_words)

44 parallel sequences Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 3 sequence 7 padding sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

45 parallel sequences Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 padding schedule into the same slot, it may come for free! sequence 4 sequence 7 sequence 5 sequence 6 CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding ( garbage-in/garbage-out ) sequence reductions

46 parallel sequences Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel sequence 1 sequence 2 sequence 3 sequence 4 sequence 7 padding Fully transparent batching Recurrent CNTK unrolls, handles sequence boundaries Non-recurrent operations parallel Sequence reductions mask sequence 5 sequence 6

47 Data-Parallel Training Degrees of parallelism: Within-vector parallelization: vectorized Across independent samples: batching Across GPUs: async PCIe device-to-device transfers Across servers: MPI etc., NVidia NCCL Parallelization options: Data-parallel Model-parallel Layer-parallel

48 Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 S all-reduce

49 Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3

50 Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients node 1 node 2 node 3 ring algorithm O(2 (K-1)/K M) O(1) w.r.t. K

51 Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients O(1) enough? Example: DNN, MB size 1024, 160M model parameters compute per MB: 1/7 second communication per MB: 1/9 second (640M over 6 GB/s) can t even parallelize to 2 GPUs: communication cost already dominates! How about doing it asynchronously? HogWild! [-], DistBelief ASGD [Dean et al., 2012] Helps with latency and jitter, could hide some communication cost with pipeline Does not change the problem fundamentally

52 Data-Parallel Training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs, Interspeech 2014] Quantize gradients to 1 bit per value Trick: carry over quantization error to next minibatch 1-bit quantized with residual node 1 node 2 node 3 1-bit quantized with residual

53 Data-Parallel Training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: 1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs, Interspeech 2014] quantize gradients to 1 bit per value trick: carry over quantization error to next minibatch communicate less often Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: ON Parallelizability of Stochastic Gradient Descent..., ICASSP 2014] Block momentum [K. Chen, Q. Huo: Scalable training of deep learning machines by incremental block training, ICASSP 2016] Very recent, very effective parallelization method Combines model averaging with error-residual idea

54 Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions

55 Where to begin? Tutorials:

57 Where to begin? Azure Notebooks: Try for free pre-hosted

61 Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions

62 Conclusions CNTK: fast, scalable, and extensible CNTK Unique features: Symbolic loop Batch scheduling Data parallel training Educational resources Azure Notebooks with CNTK (at no cost) EdX Deep Learning Explained course

63 Thanks!

66 Power point-plug in for Translating

67 Data Partition Partition randomly training dataset D into S minibatches D = {B i i = 1,2,, S} Group every τ mini-batches to form a split Group every N splits to form a data block Training dataset D consists of M data blocks S = M N τ Training dataset is processed block-by-block Incremental Block Training (IBT)

68 Intra-Block Parallel Optimization (IBPO) Select randomly an unprocessed data block denoted as D t Distribute N splits of D t to N parallel workers Starting from an initial model denoted as W init (t), each worker optimizes its local model independently by 1-sweep mini-batch SGD with momentum trick Average N optimized local models to get W(t)

69 Blockwise Model-Update Filtering (BMUF) Generate model-update resulting from data block D t : G t = W t W init t Calculate global model-update: Δ t = η t Δ t 1 + ς t G t ς t : Block Learning Rate (BLR) η t : Block Momentum (BM) When ς t = 1 and η t = 0 MA Update global model W t = W t 1 + Δ t Generate initial model for next data block Classical Block Momentum (CBM) W init t + 1 = W t Nesterov Block Momentum (NBM) W init t + 1 = W t + η t+1 Δ t

70 Iteration Repeat IBPO and BMUF until all data blocks are processed So-called one sweep Re-partition training set for a new sweep, repeat the above step Repeat the above step until a stopping criterion is satisfied Obtain the final global model W final

71 WER(%) WER(%) Benchmark Result of Parallel Training on CNTK Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively bit-average 1bit-peak BMUF-average BMUF-peak 1bit/BMUF Speedup Factors in LSTM Training GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs Credit: Yongqiang Wang, Kai Chen, Qiang Huo WER of CE-trained DNN with different # of GPUs bit BMUF # of GPUs WER of CE-trained LSTM with different # of GPUs bit 10.6 BMUF # of GPUs

72 Results Achievement Almost linear speedup without degradation of model quality Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks Released in CNTK as a critical differentiator Used for enterprise scale production data loads Production tools in other companies such as iflytek and Alibaba

CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China

CNTK Microsoft s Open Source Deep Learning Toolkit Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China Deep learning in Microsoft Cognitive Services https://how-old.net http://www.captionbot.ai