DMTK 超大规模深度学习框架及其在文本理解中的应用

Size: px

Start display at page:

Download "DMTK 超大规模深度学习框架及其在文本理解中的应用"

Derrick Taylor
6 years ago
Views:

1 DMTK 超大规模深度学习框架及其在文本理解中的应用 Taifeng Wang Lead Researcher Machine Learning Group, MSRA 2016 GTC China 9/13/2016 Taifeng China 1

2 Microsoft Research Lab Locations Redmond, Washington, USA Sep 1991 Cambridge, Massachusetts, USA 2008 New York, USA May 2012 Cambridge, UK July 1997 Beijing, China Nov 1998 Bangalore, India Jan 2005

3 Microsoft Research Asia 1.0 Technologies transferred into all major Microsoft products

About DMTK (Distributed machine learning toolkit) We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model

4 About DMTK (Distributed machine learning toolkit) We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model learning tasks. Release to github on by Machine Learning group of MSRA. 9/13/2016 Taifeng China 4

5 DMTK User Engagement Within just one week after release ( ): stars and 200+ forks at GitHub 1M+ visits to DMTK homepage 300K+ downloads of binary executables Major upgrade (2016.9) 9/13/2016 Taifeng China 5

6 有关 DMTK 发展历程 Parameter server 1.0 Distributed LightLDA Distributed Word2Vec Enrich program language support for parameter server Python Lua Connect with torch/caffe/theano Innovation on distributed optimization: DC-ASGD, ensemble model, accelerated optimization methods Model parallelism on deep learning models 2C-RNN for text understanding Graph embedding Parameter server 2.0 Simpler SDK usage System performance enhancement, e.g. memory & network cost reduction Deep integration with CNTK Distributed Logistic Regression with online update (FTRL) Distributed gradient boosting decision tree(gbdt) /13/2016 Taifeng China 6

Microsoft Distributed Machine Learning Toolkit (DMTK) 2D-RNN Logistic Regression LightGBM

Other Single node dnn tools: Theano/caffe/torch parallelize different machine learning

synchronization mechanism Model Slicing Matrix / tensor Hash table Hybrid model store Tree

7 Microsoft Distributed Machine Learning Toolkit (DMTK) 2D-RNN Logistic Regression LightGBM LightLDA Districted Word Embedding Distributed machine learning algorithms AzureML CNTK Other Single node dnn tools: Theano/caffe/torch parallelize different machine learning toolkits Multiverso Parameter Server ASGD /DC-ASGD MA / ADMM / BMUF Distributed synchronization mechanism Model Slicing Matrix / tensor Hash table Hybrid model store Tree MPI ZeroMQ RDMA GPU direct Rich communication interface Execution Engines YARN 9/13/2016 Taifeng Wang@GTC China 7

8 Workload supported LightLDA Word2Vec GBDT LSTM CNN Online FTRL Model Model Model Model Model Model 20M vocab, 1M topics (largest topic model) 10M vocab, 1000 dim (largest word embedding) 3000 trees (120-node) (GBDT) 20M parameters (4 hidden layer, LSTM) 41M parameters (GoogleLeNet) 800M parameters (Logistic Regression) Data Data Data Data Data Data 200B tokens (Bing web chunk) 200B samples (Bing web chunk) 7M records (Bing HRS data) 1570 hrs speech data (Win phone data 2M images (ImageNet 1K dataset) 6.4B impressions (Bing Ads click log) Training time 60 hrs on 24 machines (nearly linear speed-up) Training time 40 hrs on 8 machines (nearly linear speed-up) Training time 3 hrs on 8 machines (4x of speed-up) Training time 1 day on 16 GPUs (15.9x speed-up) Training time 30 hrs on 16 GPUs (10x speed-up Training time 2400s on 24 machines (12x speed-up) 4/25/2016 Taifeng HKUST 8

9 Forward looking - Microsoft Cognitive Toolkits 9/13/2016 Taifeng Wang@GTC China 9

10 如何推动大规模机器学习的发展 System Innovation One needs to leverage the full power of distributed system, and pursue almost linear scale out/speed up. New distributed training paradigm needs to be invented in order to revolve the bottle neck of existing distributed machine learning systems. Algorithmic Innovation Machine learning algorithms themselves need to have sufficiently high efficiency and throughout. Existing design/implementation of machine learning algorithms might not have considered this request; redesign/re-implementation might be needed. 9/13/2016 Taifeng China 10

11 Evolution of Distributed ML Asynchronous Dataflow (Deep learning) Parameter Server (Deep learning, LDA, GBDT, LR) Synchronous Iterative MapReduce (LDA, LR) Data Parallelism Model Parallelism Irregular Parallelism 9/13/2016 Taifeng China 11

to sync parameters among workers Only synchronous update

12 Evolution of Distributed Machine Learning Iterative MapReduce Local computation Synchronous update Use MapReduce / AllReduce to sync parameters among workers Only synchronous update Example: Spark and other derived systems 9/13/2016 Taifeng Wang@GTC China 12

Evolution of Distributed Machine Learning Iterative MapReduce Parameter Server Parameter server (PS) based solution is proposed to support: Asynchronous update Different mechanisms for model

13 Evolution of Distributed Machine Learning Iterative MapReduce Parameter Server Parameter server (PS) based solution is proposed to support: Asynchronous update Different mechanisms for model aggregation, especially in asynchronous manner Model parallelism Example: Google s DistBelief; Petuum Multiverso PS + NIPS 12 DistBelief (Google), NIPS 13 Petuum (Eric Xing), OSDI 14 Parameter server (Mu Li), Multiverso PS etc. 9/13/2016 Taifeng Wang@GTC China 13

14 Evolution of Distributed Machine Learning Iterative MapReduce Parameter Server Dataflow Dataflow Resource Task scheduling & execution based on: 1. Data dependency 2. Resource availability Dataflow based solution is proposed to support: Irregular parallelism (e.g., hybrid data- and model-parallelism), particularly in deep learning Both high-level abstraction and low-level flexibility in implementation Example: Google s TensorFlow + Tensorflow, Eusys 07 Dryad (Microsoft), NSDI 12 Spark (AMP Lab) 9/13/2016 Taifeng Wang@GTC China 14

15 Delay compensate ASGD Our work on system innovation 9/13/2016 Taifeng China 15

16 Delayed Gradients Sequential SGD w t+τ+1 = w t+τ η g w t+τ Async SGD w t+τ+1 = w t+τ η g w t g w t+τ = g w t + g w t w t+τ w t + O( w t+τ w t 2 ) g w t corresponds to the Hessian matrix 9/13/2016 Taifeng Wang@GTC China 16

17 Unbiased Efficient Approximation of Hessian Matrix Theorem: Assume that Y is discrete random variable and P Y = k X = x, w = σ k (x; w), where σ k x; w < 1, X, w, k = 1,, K. Let L x, y, w = I y=k log σ k (x; w) Then we can prove that there exists a function φ, such that: 2 E (Y x,w) w 2 L X, Y; w = E Y x,w φ L X, Y; w w k. For cross-entropy loss, the second-order derivatives can be derived from first-order derivatives in an unbiased manner. 9/13/2016 Taifeng Wang@GTC China 17

18 Delay Compensated ASGD (DC-ASGD) ASGD: w t+τ+1 = w t+τ η g w t DC-ASGD: w t+τ+1 = w t+τ η g w t λφ(g w t ) w t+τ w t 9/13/2016 Taifeng Wang@GTC China 18

19 Experimental Result (based on ResNet) CFAR ImageNet 9/13/2016 Taifeng China 19

20 2C-RNN: A Super Efficient and Scalable Deep Algorithm for Text Understanding Our work on algorithm innovation published in NIPS /13/2016 Taifeng Wang@GTC China 20

21 Recurrent Neural Networks for text applications A widely used model for sequence representation and learning Language modeling Machine translation Conversation bot Image/video caption Major challenges: efficiency and scalability 9/13/2016 Taifeng Wang@GTC China 21

22 Language modeling h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t Symbol Definition Dimension x t U W V y t (Input) Embedding vector of the word at position t Parameter matrix: input hidden state Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output Predicted probability for each word w h w h h V h V 9/13/2016 Taifeng Wang@GTC China 22

23 Challenge in text applications: model size Large scale Dataset #token Vocab Clueweb09(en) 143,820,387,816 10,784,180 Large model size current GPU cannot support Symbol Definition Dimension Memory Size x (Input) Embedding vector of the word at position t V w 10M*1024*4B = 40G U Parameter matrix: input hidden state h w 1024*1024*4B=4M W V Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output h h V h 1024*1024*4B=4M 10M*1024*4B = 40G y t Predicted probability for each word V 10M*4B = 40M 9/13/2016 Taifeng Wang@GTC China 23

24 Challenge in text applications: running time Large scale Dataset #token Vocab Clueweb09(en) 143,820,387,816 10,784,180 Huge computation complexity To choose one word, we need to go through all the words in the vocabulary #operation Operation unit h t 2 million Float operation o t 10 billion Float operation h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 9/13/2016 Taifeng Wang@GTC China 24

25 Training time estimation on mainstream hardware Dataset #token Voc Clueweb09(en) 143,820,387,816 10,784,1 Device Computation (#core, flops) Global Mem Cap./BW Running time Xeon Broadwell 14nm (V16) Broadwell 2*20Core, TFLOPS 8x32GB DDR4 (256GB)/95GBps 0.143T*10G*10*2/0.736T /3600/24/365=1232 years GPU (K40) 28nm 5.0 TFLOPS (float32) (2880 cores) 12GB GDDR5 /288GBps 0.143T*10G*10*2/5T/360 0/24/365=181 years GPU (M40) 28nm 6.8 TFLOPS (float32) (3072 cores) 24GB GDDR5/ 288GBps 0.143T*10G*10*2/6.8T/3 600/24/365=133 years GPU (P100) 16nm 10.6 TFLOPS (float32) 21.2 TFLOPS (float16) 16GB HBM2/720GBps 0.143T*10G*10*2/10.6T/ 3600/24/365=85 years 0.143T*10G*10*2/5T/3600/24/365=181 years #tokens #operations per token #epochs #FLOPS Forward and backward propagation 9/13/2016 Taifeng China 25

26 Big challenge to algorithm innovation and hardware manufactory Key problem - Huge vocabulary Symbol Definition Dimension Memory Size x (Input) Embedding vector of the word at position t V w 10M*1024*4B = 40G U Parameter matrix: input hidden state h w 1024*1024*4B=4M W V Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output h h V h 1024*1024*4B=4M 10M*1024*4B = 40G y t Predicted probability for each word V 10M*4B = 40M 9/13/2016 Taifeng Wang@GTC China 26

27 Our proposal: 2-Component shared embedding (Accepted by NIPS 2016) Current practice Our approach Embedding vector word Embedding vector y 1 y 2 y 3 y 4 Embedding vector word x 1 x 2 x 15 x 16 January February one two x 1 January February x 2 one two x 3 x 4 2 V vectors == x 1, y 1 (x 1, y 2 ) (x 2, y 1 ) (x 2, y 2 ) January February one two V vectors 2C: each word is partitioned and represented by two vectors (x, y) Shared embedding: x is shared in the same row, y is shared in the same column 9/13/2016 Taifeng Wang@GTC China 27

28 2C-RNN Predicted current word Predicted next word w t w t+1 w t w t+1 P c (w t 1 ) P r (w t ) P c (w t ) P r (w t+1 ) P c (w t+1 ) P(w t ) P(w t+1 ) Y c Y r Y c Y r Y c Y Y U r h t 1 U c h t 1 U h t r U h t c U r h t+1 U h t 1 U h t U W W W W W W W r x t 1 c x t 1 x t r x t c r x t+1 x t 1 x t X r X c X r X c X r X X w t 1 w t w t 1 w t Previous word Current word 9/13/2016 Taifeng Wang@GTC China 28

29 Analysis on model size h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 80G 70M Symbol Definition Dimension Memory Size x, y (Input) Embedding vector of the word at position t 2 V w 2*(10M) 1/2 *1024*4B = 25M U x, U y Parameter matrix: input hidden state 2 h w 2*1024*1024*4B=8M W x, W y Parameter matrix: hidden state hidden state 2 h h 2*1024*1024*4B=8M V x, V y Output embedding matrix: hidden state output 2 V h 2*(10M) 1/2 *1024*4B = 25M y t Predicted probability for each word 2 V <1M 9/13/2016 Taifeng Wang@GTC China 29

30 Analysis on computational complexity h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 10G 10M #operation per token Operation unit h t t x, h y 4M Float operation o t t x, o y 2 10M 1024 = 6M Float operation GPU (K40) 28nm 5.0 TFLOPS (float32) (2880 cores) 12GB GDDR5 /288GBps 0.143T*10M*10*2/5T/36 00/24/365=0.18 years Training time estimation: 181 years 0.18 years Now we can easily parallel with parameter server framework, if you have 20 machine -> 2-3 days 9/13/2016 Taifeng Wang@GTC China 30

31 How to allocate words into the 2D table Cold start Row partition according to prefix Column partition according to suffix Bootstrap Train with current partitions for several iterations Adjust partitions based on training loss Continue training react real return rexxxxx Billion Million Trillion xxxllion In the same row In the same column sure prepare gre xxxxre In the same column 9/13/2016 Taifeng China 31

Experimental results Middle-sized Dataset 2013 ACL Workshop dataset PPL. on test (ACLW-Spanish) KN-4 [1] 219 243 MLBL[1] 203 227 PPL.

32 Experimental results Middle-sized Dataset 2013 ACL Workshop dataset PPL. on test (ACLW-Spanish) KN-4 [1] MLBL[1] PPL. on test (ACLW-French) model size LSTM word-in/word-out M LSTM char-cnn-in/wordout [2] M Our 2C-RNN [cold start] M Our 2C-RNN [bootstrap] M To achieve same PPL with HSM baseline Method (1 GPU) Runtime(hours) HSM C-RNN % Reallocation /Training [1] non-lstm-rnn method baseline. [2] previous state-of-art method using character-cnn-input, Character-Aware Neural Language Models, 9/13/2016 Taifeng Wang@GTC China 32

33 Experimental results Large scale Dataset one billion benchmark: PPL. on test KN-5 [1] 68 2 G HSM [2] G Blackout-RNN [3] G Our 2C-RNN [cold start] M Our 2C-RNN [bootstrap] M KN + HSM [2] KN + Blackout-RNN [3] KN + 2C-RNN model size To achieve same PPL with HSM baseline Method (1 GPU) Runtime(hours) HSM C-RNN % Reallocation /Training [1] One Billion Word benchmark for measuring progress in statistical language modeling, [2] Strategies for Training Large Vocabulary Neural Language Models, [3] blackout:speeding up recurrent neural network language models with very large vocabularies, 9/13/2016 Taifeng Wang@GTC China 33

how to acquire, select, and partition the data Learning the optimal network structure Learning how to perform model update Learning how to

34 Summary and forward looking DMTK includes innovation from both system and algorithm Excellent speed up and widely available system integration Advanced distributed optimization method Many world leading algorithms Machine learning for distributed deep learning Learning how to acquire, select, and partition the data Learning the optimal network structure Learning how to perform model update Learning how to tune the hyper-parameters Learning how to aggregate local models Create an AI that can automatically create new AI! 9/13/2016 Taifeng China 34

DMTK 有关材料 dmtk@microsoft.com http://www.dmtk.io https://github.

35 DMTK 有关材料欢迎加入我们的微信群, 一起讨论大数据人工智能 Thanks! 分布式机器学习联盟 9/13/2016 Taifeng China 35

36 Bootstrap: bipartite graph matching 9/13/2016 Taifeng China 36

37 Comparison Class based softmax c 1 c 2 c k w 1,1 w 1,k w k,1 w k,k Model size Training time Test time Generalization time Standard O( V w) O( V w) O( V w) O( V w) Tree based softmax O( V w) O(log V w) O(log V w) O( V w) Class based softmax O( V w) O( V w) O( V w) O( V w) Our 2C O( V w) O( V w) O( V w) O( V w) 9/13/2016 Taifeng Wang@GTC China 37

CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China

CNTK Microsoft s Open Source Deep Learning Toolkit Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China Deep learning in Microsoft Cognitive Services https://how-old.net http://www.captionbot.ai