DMTK 超大规模深度学习框架及其在文本理解中的应用
|
|
- Derrick Taylor
- 6 years ago
- Views:
Transcription
1 DMTK 超大规模深度学习框架及其在文本理解中的应用 Taifeng Wang Lead Researcher Machine Learning Group, MSRA 2016 GTC China 9/13/2016 Taifeng China 1
2 Microsoft Research Lab Locations Redmond, Washington, USA Sep 1991 Cambridge, Massachusetts, USA 2008 New York, USA May 2012 Cambridge, UK July 1997 Beijing, China Nov 1998 Bangalore, India Jan 2005
3 Microsoft Research Asia 1.0 Technologies transferred into all major Microsoft products
4 About DMTK (Distributed machine learning toolkit) We focus on providing distributed machine learning infrastructure and algorithms to handle big data and big model learning tasks. Release to github on by Machine Learning group of MSRA. 9/13/2016 Taifeng China 4
5 DMTK User Engagement Within just one week after release ( ): stars and 200+ forks at GitHub 1M+ visits to DMTK homepage 300K+ downloads of binary executables Major upgrade (2016.9) 9/13/2016 Taifeng China 5
6 有关 DMTK 发展历程 Parameter server 1.0 Distributed LightLDA Distributed Word2Vec Enrich program language support for parameter server Python Lua Connect with torch/caffe/theano Innovation on distributed optimization: DC-ASGD, ensemble model, accelerated optimization methods Model parallelism on deep learning models 2C-RNN for text understanding Graph embedding Parameter server 2.0 Simpler SDK usage System performance enhancement, e.g. memory & network cost reduction Deep integration with CNTK Distributed Logistic Regression with online update (FTRL) Distributed gradient boosting decision tree(gbdt) /13/2016 Taifeng China 6
7 Microsoft Distributed Machine Learning Toolkit (DMTK) 2D-RNN Logistic Regression LightGBM LightLDA Districted Word Embedding Distributed machine learning algorithms AzureML CNTK Other Single node dnn tools: Theano/caffe/torch parallelize different machine learning toolkits Multiverso Parameter Server ASGD /DC-ASGD MA / ADMM / BMUF Distributed synchronization mechanism Model Slicing Matrix / tensor Hash table Hybrid model store Tree MPI ZeroMQ RDMA GPU direct Rich communication interface Execution Engines YARN 9/13/2016 Taifeng Wang@GTC China 7
8 Workload supported LightLDA Word2Vec GBDT LSTM CNN Online FTRL Model Model Model Model Model Model 20M vocab, 1M topics (largest topic model) 10M vocab, 1000 dim (largest word embedding) 3000 trees (120-node) (GBDT) 20M parameters (4 hidden layer, LSTM) 41M parameters (GoogleLeNet) 800M parameters (Logistic Regression) Data Data Data Data Data Data 200B tokens (Bing web chunk) 200B samples (Bing web chunk) 7M records (Bing HRS data) 1570 hrs speech data (Win phone data 2M images (ImageNet 1K dataset) 6.4B impressions (Bing Ads click log) Training time 60 hrs on 24 machines (nearly linear speed-up) Training time 40 hrs on 8 machines (nearly linear speed-up) Training time 3 hrs on 8 machines (4x of speed-up) Training time 1 day on 16 GPUs (15.9x speed-up) Training time 30 hrs on 16 GPUs (10x speed-up Training time 2400s on 24 machines (12x speed-up) 4/25/2016 Taifeng HKUST 8
9 Forward looking - Microsoft Cognitive Toolkits 9/13/2016 Taifeng Wang@GTC China 9
10 如何推动大规模机器学习的发展 System Innovation One needs to leverage the full power of distributed system, and pursue almost linear scale out/speed up. New distributed training paradigm needs to be invented in order to revolve the bottle neck of existing distributed machine learning systems. Algorithmic Innovation Machine learning algorithms themselves need to have sufficiently high efficiency and throughout. Existing design/implementation of machine learning algorithms might not have considered this request; redesign/re-implementation might be needed. 9/13/2016 Taifeng China 10
11 Evolution of Distributed ML Asynchronous Dataflow (Deep learning) Parameter Server (Deep learning, LDA, GBDT, LR) Synchronous Iterative MapReduce (LDA, LR) Data Parallelism Model Parallelism Irregular Parallelism 9/13/2016 Taifeng China 11
12 Evolution of Distributed Machine Learning Iterative MapReduce Local computation Synchronous update Use MapReduce / AllReduce to sync parameters among workers Only synchronous update Example: Spark and other derived systems 9/13/2016 Taifeng Wang@GTC China 12
13 Evolution of Distributed Machine Learning Iterative MapReduce Parameter Server Parameter server (PS) based solution is proposed to support: Asynchronous update Different mechanisms for model aggregation, especially in asynchronous manner Model parallelism Example: Google s DistBelief; Petuum Multiverso PS + NIPS 12 DistBelief (Google), NIPS 13 Petuum (Eric Xing), OSDI 14 Parameter server (Mu Li), Multiverso PS etc. 9/13/2016 Taifeng Wang@GTC China 13
14 Evolution of Distributed Machine Learning Iterative MapReduce Parameter Server Dataflow Dataflow Resource Task scheduling & execution based on: 1. Data dependency 2. Resource availability Dataflow based solution is proposed to support: Irregular parallelism (e.g., hybrid data- and model-parallelism), particularly in deep learning Both high-level abstraction and low-level flexibility in implementation Example: Google s TensorFlow + Tensorflow, Eusys 07 Dryad (Microsoft), NSDI 12 Spark (AMP Lab) 9/13/2016 Taifeng Wang@GTC China 14
15 Delay compensate ASGD Our work on system innovation 9/13/2016 Taifeng China 15
16 Delayed Gradients Sequential SGD w t+τ+1 = w t+τ η g w t+τ Async SGD w t+τ+1 = w t+τ η g w t g w t+τ = g w t + g w t w t+τ w t + O( w t+τ w t 2 ) g w t corresponds to the Hessian matrix 9/13/2016 Taifeng Wang@GTC China 16
17 Unbiased Efficient Approximation of Hessian Matrix Theorem: Assume that Y is discrete random variable and P Y = k X = x, w = σ k (x; w), where σ k x; w < 1, X, w, k = 1,, K. Let L x, y, w = I y=k log σ k (x; w) Then we can prove that there exists a function φ, such that: 2 E (Y x,w) w 2 L X, Y; w = E Y x,w φ L X, Y; w w k. For cross-entropy loss, the second-order derivatives can be derived from first-order derivatives in an unbiased manner. 9/13/2016 Taifeng Wang@GTC China 17
18 Delay Compensated ASGD (DC-ASGD) ASGD: w t+τ+1 = w t+τ η g w t DC-ASGD: w t+τ+1 = w t+τ η g w t λφ(g w t ) w t+τ w t 9/13/2016 Taifeng Wang@GTC China 18
19 Experimental Result (based on ResNet) CFAR ImageNet 9/13/2016 Taifeng China 19
20 2C-RNN: A Super Efficient and Scalable Deep Algorithm for Text Understanding Our work on algorithm innovation published in NIPS /13/2016 Taifeng Wang@GTC China 20
21 Recurrent Neural Networks for text applications A widely used model for sequence representation and learning Language modeling Machine translation Conversation bot Image/video caption Major challenges: efficiency and scalability 9/13/2016 Taifeng Wang@GTC China 21
22 Language modeling h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t Symbol Definition Dimension x t U W V y t (Input) Embedding vector of the word at position t Parameter matrix: input hidden state Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output Predicted probability for each word w h w h h V h V 9/13/2016 Taifeng Wang@GTC China 22
23 Challenge in text applications: model size Large scale Dataset #token Vocab Clueweb09(en) 143,820,387,816 10,784,180 Large model size current GPU cannot support Symbol Definition Dimension Memory Size x (Input) Embedding vector of the word at position t V w 10M*1024*4B = 40G U Parameter matrix: input hidden state h w 1024*1024*4B=4M W V Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output h h V h 1024*1024*4B=4M 10M*1024*4B = 40G y t Predicted probability for each word V 10M*4B = 40M 9/13/2016 Taifeng Wang@GTC China 23
24 Challenge in text applications: running time Large scale Dataset #token Vocab Clueweb09(en) 143,820,387,816 10,784,180 Huge computation complexity To choose one word, we need to go through all the words in the vocabulary #operation Operation unit h t 2 million Float operation o t 10 billion Float operation h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 9/13/2016 Taifeng Wang@GTC China 24
25 Training time estimation on mainstream hardware Dataset #token Voc Clueweb09(en) 143,820,387,816 10,784,1 Device Computation (#core, flops) Global Mem Cap./BW Running time Xeon Broadwell 14nm (V16) Broadwell 2*20Core, TFLOPS 8x32GB DDR4 (256GB)/95GBps 0.143T*10G*10*2/0.736T /3600/24/365=1232 years GPU (K40) 28nm 5.0 TFLOPS (float32) (2880 cores) 12GB GDDR5 /288GBps 0.143T*10G*10*2/5T/360 0/24/365=181 years GPU (M40) 28nm 6.8 TFLOPS (float32) (3072 cores) 24GB GDDR5/ 288GBps 0.143T*10G*10*2/6.8T/3 600/24/365=133 years GPU (P100) 16nm 10.6 TFLOPS (float32) 21.2 TFLOPS (float16) 16GB HBM2/720GBps 0.143T*10G*10*2/10.6T/ 3600/24/365=85 years 0.143T*10G*10*2/5T/3600/24/365=181 years #tokens #operations per token #epochs #FLOPS Forward and backward propagation 9/13/2016 Taifeng China 25
26 Big challenge to algorithm innovation and hardware manufactory Key problem - Huge vocabulary Symbol Definition Dimension Memory Size x (Input) Embedding vector of the word at position t V w 10M*1024*4B = 40G U Parameter matrix: input hidden state h w 1024*1024*4B=4M W V Parameter matrix: hidden state hidden state Output embedding matrix: hidden state output h h V h 1024*1024*4B=4M 10M*1024*4B = 40G y t Predicted probability for each word V 10M*4B = 40M 9/13/2016 Taifeng Wang@GTC China 26
27 Our proposal: 2-Component shared embedding (Accepted by NIPS 2016) Current practice Our approach Embedding vector word Embedding vector y 1 y 2 y 3 y 4 Embedding vector word x 1 x 2 x 15 x 16 January February one two x 1 January February x 2 one two x 3 x 4 2 V vectors == x 1, y 1 (x 1, y 2 ) (x 2, y 1 ) (x 2, y 2 ) January February one two V vectors 2C: each word is partitioned and represented by two vectors (x, y) Shared embedding: x is shared in the same row, y is shared in the same column 9/13/2016 Taifeng Wang@GTC China 27
28 2C-RNN Predicted current word Predicted next word w t w t+1 w t w t+1 P c (w t 1 ) P r (w t ) P c (w t ) P r (w t+1 ) P c (w t+1 ) P(w t ) P(w t+1 ) Y c Y r Y c Y r Y c Y Y U r h t 1 U c h t 1 U h t r U h t c U r h t+1 U h t 1 U h t U W W W W W W W r x t 1 c x t 1 x t r x t c r x t+1 x t 1 x t X r X c X r X c X r X X w t 1 w t w t 1 w t Previous word Current word 9/13/2016 Taifeng Wang@GTC China 28
29 Analysis on model size h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 80G 70M Symbol Definition Dimension Memory Size x, y (Input) Embedding vector of the word at position t 2 V w 2*(10M) 1/2 *1024*4B = 25M U x, U y Parameter matrix: input hidden state 2 h w 2*1024*1024*4B=8M W x, W y Parameter matrix: hidden state hidden state 2 h h 2*1024*1024*4B=8M V x, V y Output embedding matrix: hidden state output 2 V h 2*(10M) 1/2 *1024*4B = 25M y t Predicted probability for each word 2 V <1M 9/13/2016 Taifeng Wang@GTC China 29
30 Analysis on computational complexity h t = σ Ux t + Wh t 1 + b o t = Vh t y t = softmax o t 10G 10M #operation per token Operation unit h t t x, h y 4M Float operation o t t x, o y 2 10M 1024 = 6M Float operation GPU (K40) 28nm 5.0 TFLOPS (float32) (2880 cores) 12GB GDDR5 /288GBps 0.143T*10M*10*2/5T/36 00/24/365=0.18 years Training time estimation: 181 years 0.18 years Now we can easily parallel with parameter server framework, if you have 20 machine -> 2-3 days 9/13/2016 Taifeng Wang@GTC China 30
31 How to allocate words into the 2D table Cold start Row partition according to prefix Column partition according to suffix Bootstrap Train with current partitions for several iterations Adjust partitions based on training loss Continue training react real return rexxxxx Billion Million Trillion xxxllion In the same row In the same column sure prepare gre xxxxre In the same column 9/13/2016 Taifeng China 31
32 Experimental results Middle-sized Dataset 2013 ACL Workshop dataset PPL. on test (ACLW-Spanish) KN-4 [1] MLBL[1] PPL. on test (ACLW-French) model size LSTM word-in/word-out M LSTM char-cnn-in/wordout [2] M Our 2C-RNN [cold start] M Our 2C-RNN [bootstrap] M To achieve same PPL with HSM baseline Method (1 GPU) Runtime(hours) HSM C-RNN % Reallocation /Training [1] non-lstm-rnn method baseline. [2] previous state-of-art method using character-cnn-input, Character-Aware Neural Language Models, 9/13/2016 Taifeng Wang@GTC China 32
33 Experimental results Large scale Dataset one billion benchmark: PPL. on test KN-5 [1] 68 2 G HSM [2] G Blackout-RNN [3] G Our 2C-RNN [cold start] M Our 2C-RNN [bootstrap] M KN + HSM [2] KN + Blackout-RNN [3] KN + 2C-RNN model size To achieve same PPL with HSM baseline Method (1 GPU) Runtime(hours) HSM C-RNN % Reallocation /Training [1] One Billion Word benchmark for measuring progress in statistical language modeling, [2] Strategies for Training Large Vocabulary Neural Language Models, [3] blackout:speeding up recurrent neural network language models with very large vocabularies, 9/13/2016 Taifeng Wang@GTC China 33
34 Summary and forward looking DMTK includes innovation from both system and algorithm Excellent speed up and widely available system integration Advanced distributed optimization method Many world leading algorithms Machine learning for distributed deep learning Learning how to acquire, select, and partition the data Learning the optimal network structure Learning how to perform model update Learning how to tune the hyper-parameters Learning how to aggregate local models Create an AI that can automatically create new AI! 9/13/2016 Taifeng China 34
35 DMTK 有关材料 欢迎加入我们的微信群, 一起讨论大数据人工智能 Thanks! 分布式机器学习联盟 9/13/2016 Taifeng China 35
36 Bootstrap: bipartite graph matching 9/13/2016 Taifeng China 36
37 Comparison Class based softmax c 1 c 2 c k w 1,1 w 1,k w k,1 w k,k Model size Training time Test time Generalization time Standard O( V w) O( V w) O( V w) O( V w) Tree based softmax O( V w) O(log V w) O(log V w) O( V w) Class based softmax O( V w) O( V w) O( V w) O( V w) Our 2C O( V w) O( V w) O( V w) O( V w) 9/13/2016 Taifeng Wang@GTC China 37
CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China
CNTK Microsoft s Open Source Deep Learning Toolkit Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China Deep learning in Microsoft Cognitive Services https://how-old.net http://www.captionbot.ai
More informationOutline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training
Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationDot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationA Practitioner s Guide to MXNet
1/34 A Practitioner s Guide to MXNet Xingjian Shi Hong Kong University of Science and Technology (HKUST) HKUST CSE Seminar, March 31st, 2017 2/34 Outline 1 Introduction Deep Learning Basics MXNet Highlights
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationTask-Oriented Dialogue System (Young, 2000)
2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationMachine Learning. Boris
Machine Learning Boris Nadion boris@astrails.com @borisnadion @borisnadion boris@astrails.com astrails http://astrails.com awesome web and mobile apps since 2005 terms AI (artificial intelligence)
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationMemory-Augmented Attention Model for Scene Text Recognition
Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences
More informationFast, Cheap and Deep Scaling machine learning
Fast, Cheap and Deep Scaling machine learning SFW Alexander Smola CMU Machine Learning and github.com/dmlc Many thanks to Mu Li Dave Andersen Chris Dyer Li Zhou Ziqi Liu Manzil Zaheer Qicong Chen Amr Ahmed
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationConditional Language modeling with attention
Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability
More informationRecurrent Neural Networks. COMP-550 Oct 5, 2017
Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationClassification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks
Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks Mohit Shridhar Stanford University mohits@stanford.edu, mohit@u.nus.edu Abstract In particle physics, Higgs Boson to tau-tau
More informationDeep Learning intro and hands-on tutorial
Deep Learning intro and hands-on tutorial Π passalis@csd.auth.gr Ε. ώ Π ώ. Π ΠΘ 1 / 53 Deep Learning 2 / 53 Δ Έ ( ) ω π μ, ώπ ώ π ω (,...), π ώ π π π π ω ω ώπ ώ π, (biologically plausible) Δ π ώώ π ώ Ε
More informationNeural Networks. Intro to AI Bert Huang Virginia Tech
Neural Networks Intro to AI Bert Huang Virginia Tech Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: back propagation https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationLecture 15: Recurrent Neural Nets
Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But
More informationContents. (75pts) COS495 Midterm. (15pts) Short answers
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationGradient Coding: Mitigating Stragglers in Distributed Gradient Descent.
Gradient Coding: Mitigating Stragglers in Distributed Gradient Descent. Alex Dimakis joint work with R. Tandon, Q. Lei, UT Austin N. Karampatziakis, Microsoft Setup Problem: Large-scale learning. Given
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationEE-559 Deep learning Recurrent Neural Networks
EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent
More informationDANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence
DANIEL WILSON AND BEN CONKLIN Integrating AI with Foundation Intelligence for Actionable Intelligence INTEGRATING AI WITH FOUNDATION INTELLIGENCE FOR ACTIONABLE INTELLIGENCE in an arms race for artificial
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationCS 4700: Foundations of Artificial Intelligence
CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons
More informationCSC321 Lecture 9 Recurrent neural nets
CSC321 Lecture 9 Recurrent neural nets Roger Grosse and Nitish Srivastava February 3, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 9 Recurrent neural nets February 3, 2015 1 / 20 Overview You
More informationDeep Learning: a gentle introduction
Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why
More informationMaking Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation
Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation Dr. Yanjun Qi Department of Computer Science University of Virginia Tutorial @ ACM BCB-2018 8/29/18 Yanjun Qi / UVA
More informationCONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE
CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationINF 5860 Machine learning for image classification. Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018
INF 5860 Machine learning for image classification Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018 OUTLINE Deep learning frameworks TensorFlow TensorFlow graphs TensorFlow session
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationA Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN
A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this
More informationConvolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23 Content Quick Review of Recurrent Neural Network Introduction
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More information(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationT. HOEFLER
T HOEFLER Demystifying Parallel and Distributed Deep Learning - An In-Depth Concurrency Analysis Keynote at the 6 th Accelerated Data Analytics and Computing Workshop (ADAC 18) https://wwwarxivorg/abs/1829941
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationScikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python
Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing Scikit-learn Machine learning for the small
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationKnowledge Tracing Machines: Families of models for predicting student performance
Knowledge Tracing Machines: Families of models for predicting student performance Jill-Jênn Vie RIKEN Center for Advanced Intelligence Project, Tokyo Optimizing Human Learning, June 12, 2018 Polytechnique
More informationSupporting Information
Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationDimensionality Reduction and Principle Components Analysis
Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture
More informationDeep Recurrent Neural Networks
Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationConquering the Complexity of Time: Machine Learning for Big Time Series Data
Conquering the Complexity of Time: Machine Learning for Big Time Series Data Yan Liu Computer Science Department University of Southern California Mini-Workshop on Theoretical Foundations of Cyber-Physical
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationAdministrative Issues. CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html
Administrative Issues CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html 1 Assignments (10% * 4 = 40%) For each assignment It has multiple sub-tasks, including coding tasks You need to
More informationMachine Learning Techniques
Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationDemystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object
More informationStatistical Machine Learning
Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se
More informationIdentifying QCD transition using Deep Learning
Identifying QCD transition using Deep Learning Kai Zhou Long-Gang Pang, Nan Su, Hannah Peterson, Horst Stoecker, Xin-Nian Wang Collaborators: arxiv:1612.04262 Outline 2 What is deep learning? Artificial
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More information@SoyGema GEMA PARREÑO PIQUERAS
@SoyGema GEMA PARREÑO PIQUERAS WHAT IS AN ARTIFICIAL NEURON? WHAT IS AN ARTIFICIAL NEURON? Image Recognition Classification using Softmax Regressions and Convolutional Neural Networks Languaje Understanding
More informationPytorch Tutorial. Xiaoyong Yuan, Xiyao Ma 2018/01
(Li Lab) National Science Foundation Center for Big Learning (CBL) Department of Electrical and Computer Engineering (ECE) Department of Computer & Information Science & Engineering (CISE) Pytorch Tutorial
More informationJointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction
Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction Feng Qian,LeiSha, Baobao Chang, Zhifang Sui Institute of Computational Linguistics, Peking
More informationForecasting demand in the National Electricity Market. October 2017
Forecasting demand in the National Electricity Market October 2017 Agenda Trends in the National Electricity Market A review of AEMO s forecasting methods Long short-term memory (LSTM) neural networks
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More information