T. HOEFLER
|
|
- Garey Hunter
- 5 years ago
- Views:
Transcription
1 T HOEFLER Demystifying Parallel and Distributed Deep Learning - An In-Depth Concurrency Analysis Keynote at the 6 th Accelerated Data Analytics and Computing Workshop (ADAC 18)
2 What is Deep Learning good for? spclinfethzch Digit Recognition Object Classification Segmentation Image Captioning Gameplay AI Translation Neural Computers A very promising area of research! 23 papers per day! 1989 number of papers per year
3 How does Deep Learning work? Canziani et al 217 Number of users 8 bn Deep Learning is f(x) Supercomputing! Cat Dog Airplane Horse Bicycle Truck Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update ImageNet (1k): 18 GB ImageNet (22k): A few TB Industry: Much larger 1-2 layers deep ~1M-2B parameters 1-8 GiB parameter storage 1-22k labels growing (eg, face recognition) weeks to train
4 A brief theory of supervised deep learning f(x) Cat Dog Airplane Horse Bicycle Truck Cat Dog Airplane Horse Bicycle Truck 1 layer-wise weight update labeled samples x X D label domain Y true label l(x) f x : X Y network structure (fixed) weights w (learned) w = argmin w R d E x~d l w, x f x = f n convolution 1 convolution 2 f n 1 f n 2 f 1 x f 1 x f 2 f 1 x f(x) pooling convolution 3 fully connected l sq w, x = f x l x 2 l 1 w, x = ቊ f x = l(x) 1 f x l(x) l ce w, x = l x i log σ k e f x k i ef x i 4
5 Stochastic Gradient Descent w = argmin w R d E x~d l w, x f 1 (x) f 2 f 1 x f(x) convolution 1 convolution 2 pooling convolution 3 fully connected Layer storage = w l + f l o l 1 + w l + o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 5
6 Trends in deep learning: hardware and multi-node spclinfethzch The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs distributed memory Deep Learning is largely on distributed memory today! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 6
7 Trends in distributed deep learning: node count and communication The field is moving fast trying everything imaginable survey results from 227 papers in the area of parallel deep learning Communication mode Deep Learning research is converging to MPI! T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 7
8 Minibatch Stochastic Gradient Descent (SGD) Cat 54 Cat 1 Dog Airplane Horse Bicycle Dog Airplane Horse Bicycle Truck 2 Truck T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb 218 8
9 A primer of relevant parallelism and communication theory Parallel Reductions for Parameter Updates Small vectors Large vectors Tree Butterfly Pipeline RedScat+Gat Work W = 39 Depth D = 7 Average parallelism = W D T = 2L log 2 P + 2γmG log 2 P T = L log 2 P + γmg log 2 P T = 2L(P 1) + 2γmG(P 1)/P Lower bound: T L log 2 P + 2γmG P 1 /P T = 2L log 2 P + 2γmG(P 1)/P E Chan et al: Collective communication: theory, practice, and experience CCPE 7 9 TH, D Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI 14
10 GoogLeNet in more detail ~68M parameters 22 layers deep C Szegedy et al Going Deeper with Convolutions, CVPR 15 1
11 Parallelism in the different layer types f l x w o l * = f l x w o l W is linear and D logarithmic large average parallelism f l x w o l f l x w o l f l x w o l T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb
12 Computing fully connected layers f l x w o l x 1 x 2 x 3 w 1,2 w 1,1 w 2,1 w 2,2 w 3,1 w 3,2 σ σw i,1x i + b 1 σ σw i,2x i + b 2 x 1,1 x 1,2 x 1,3 1 x 2,1 x 2,2 x 2,3 1 x N,1 x N,2 x N,3 1 w 1,1 w 1,2 w 2,1 w 2,2 w 3,1 w 3,2 b 1 b 2 13
13 Computing convolutional layers Direct Indirect Direct * = FFT Winograd w F w im2col F F 1 = S Chetlur et al: cudnn: Efficient Primitives for Deep Learning, arxiv 214 X Liu et al: Efficient Sparse-Winograd Convolutional Neural Networks, ICLR 17 Workshop K Chellapilla et al: High Performance Convolutional Neural Networks for Document Processing, Int l Workshop on Frontiers in Handwriting Recognition 216 M Mathieu et al: Fast Training of Convolutional Networks through FFTs, ICLR 14 A Lavin and S Gray: Fast Algorithms for Convolutional Neural Networks, CVPR 16 14
14 Microbatching (µ-cudnn) Fast In cudnn (up to 454x there on are DeepBench) ~16 convolution implementations Performance depends on temporary memory (workspace) size Key idea: segment minibatch into microbatches, reuse workspace, use different algorithms How to choose microbatch sizes and algorithms? Memory efficient Dynamic Programming (Space Reuse) DP ILP Utilizes heterogeneous clusters P1-SXM2 Integer Linear Programming (Space Sharing) Oyama et al: µ-cudnn: Accelerating Deep Learning Frameworks with Micro-Batching, arxiv
15 spclinfethzch Model parallelism 1 3 Parameters can be distributed across processors Mini-batch has to be copied to all processors Backpropagation requires all-to-all communication every layer UA Muller and A Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int l Conf on Neural Networks
16 spclinfethzch Pipeline parallelism Layers/parameters can be distributed across processors Sparse communication pattern (only pipeline stages) Mini-batch has to be copied through all processors G Blelloch and CR Rosenberg: Network Learning on the Connection Machine, IJCAI 87 17
17 spclinfethzch Data parallelism Simple and efficient solution, easy to implement Duplicate parameters at all processors X Zhang et al: An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2, NIPS 89 18
18 spclinfethzch Hybrid parallelism Data Parallelism Model Parallelism Layer (pipeline) Parallelism Layers/parameters can be distributed across processors Can distribute minibatch Often specific to layer-types (eg, distribute fc layers but handle conv layers data-parallel) Enables arbitrary combinations of data, model, and pipeline parallelism very powerful! A Krizhevsky: One weird trick for parallelizing convolutional neural networks, arxiv 214 J Dean et al: Large scale distributed deep networks, NIPS 12 T Ben-Nun, T Hoefler: Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, arxiv Feb
19 Updating parameters in distributed data parallelism Decentral collective allreduce of w Central Training Agent Training Agent Training Agent Training Agent parameter server (sharded) w = u(w, w) w w T = 2L + 2P γm/s G - Collective operations - Topologies - Neighborhood collectives - RMA? T = 2L log 2 P + 2γmG(P 1)/P Hierarchical Parameter Server S Gupta et al: Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study ICDM 16 Adaptive Minibatch Size S L Smith et al: Don't Decay the Learning Rate, Increase the Batch Size, arxiv 217 Training Agent Training Agent Training Agent Training Agent 2
20 Parameter (and Model) consistency - centralized Parameter exchange frequency can be controlled, while still attaining convergence: w parameter server (sharded) w = u(w, w) w Agent 1 w 1 Synchronization w 2 Sync Agent 1 Max Staleness w 1,1 w 2,1 w 3,1 w 4,1 Sync Agent 1 Training Agent Training Agent Training Agent Training Agent w 1,1 w 2,1 w 3,1 Parameter Server w w T w Parameter Server w T Parameter Server w w T Agent m w 1 w 2 Agent m w 1,m w 2,m Agent m w 1,m w 2,m w 3,m Time Synchronous Time Stale Synchronous / Bounded Asynchronous Started with Hogwild! [Niu et al 211] shared memory, by chance DistBelief [Dean et al 212] moved the idea to distributed Trades off statistical performance for hardware performance Time Asynchronous J Dean et al: Large scale distributed deep networks, NIPS 12 F Niu et al: Hogwild: A lock-free approach to parallelizing stochastic gradient descent, NIPS 11 21
21 Merge spclinfethzch Parameter (and Model) consistency - decentralized Parameter exchange frequency can be controlled, while still attaining convergence: collective allreduce of w Training Agent Training Agent Training Agent Training Agent Max Staleness Agent 1 Agent m w () w 1 w 1 All- Reduce w 2 w 2 All- Reduce w T w T Agent 1 Agent m w () w 1,1 w 2,1 w 3,1 w 4,1 w 1,m All- Reduce w 2,m All- Reduce w (T) Agent 1 Agent r Agent k w 1,1 w 2,1 w 3,1 w 1,r w 2,r w 3,r w 4,r w 5,r w 1,k w 2,k w 3,k Agent m w 1,m w 2,m w 3,m Time Time Time Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous May also consider limited/slower distribution gossip [Jin et al 216] Peter H Jin et al, How to scale distributed deep learning?, NIPS MLSystems
22 Parameter consistency in deep learning Sync Agent 1 w 1,1 w 2,1 w 3,1 w 4,1 w 5,1 w 6,1 w Agent m Parameter Server Elastic Average w 1,m w 2,m w 3,m w 4,m w 5,m w 6,m w T Using physical forces between different versions of w: w t+1,i = w t,i η w t,i α w t,i w t m w t+1 = 1 β w t + β m i=1 w t,i Time Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent S Zhang et al: Deep learning with Elastic Averaging SGD, NIPS 15 23
23 Parameter consistency in deep learning Avg Cat Dog Airplane Horse Bicycle Truck Synchronous SGD Stale-Synchronous SGD Asynchronous SGD (HOGWILD!) Model Averaging (eg, elastic) Ensemble Learning Consistent Inconsistent T G Dietterich: Ensemble Methods in Machine Learning, MCS 2 24
24 Communication optimizations Different options how to optimize updates Send w, receive w Send FC factors (o l 1, o l ), compute w on parameter server Broadcast factors to not receive full w Use lossy compression when sending, accumulate error locally! Quantization Quantize weight updates and potentially weights Main trick is stochastic rounding [1] expectation is more accurate Enables low precision (half, quarter) to become standard TernGrad - ternary weights [2], 1-bit SGD [3], Sparsification Do not send small weight updates or only send top-k [4] Accumulate them locally parameter server (sharded) w = u(w, w) w w Training Agent Training Agent Training Agent Training Agent source: aiintelcom [1] S Gupta et al Deep Learning with Limited Numerical Precision, ICML 15 [2] F Li and B Liu Ternary Weight Networks, arxiv 216 [3] F Seide et al 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 214 [4] C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv
25 SparCML Quantified sparse allreduce for decentral updates w 1 w 2 w 3 w C Renggli et al SparCML: High-Performance Sparse Communication for Machine Learning, arxiv 218 MNIST test accuracy 26
26 Hyperparameter and Architecture search Meta-optimization of hyper-parameters (momentum) and DNN architecture Using Reinforcement Learning [1] (explore/exploit different configurations) Genetic Algorithms with modified (specialized) mutations [2] Particle Swarm Optimization [3] and other meta-heuristics Reinforcement Learning [1] Evolutionary Algorithms [4] [1] M Jaderberg et al: Population Based Training of Neural Networks, arxiv 217 [2] E Real et al: Regularized Evolution for Image Classifier Architecture Search, arxiv 218 [3] P R Lorenzo et al: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO 17 [4] H Liu et al: Hierarchical Representations for Efficient Architecture Search, ICLR 18 27
27 Application: Neural Code Comprehension In 217, GitHub reports 1 billion git commits in 337 languages! Can DNNs understand code? Previous approaches read the code directly suboptimal (loops, functions) double thres = 5; %cmp = fcmp olt double %x, 5 if (x < thres) x = y * y; else x = 2 * y; x += 1; C/C++ Python CUDA FORTRAN Java OpenCL br i1 %cmp, label %LT, label %GE LT: %2 = fmul double %y, %y GE: %3 = fmul double 2, %y AFTER: %4 = phi double [%2,%LT], [%3,%GE] %5 = fadd double %4, 1 5 %cmp %x %y 2 %y %2 %3 %LT %2 %4 %3 %5 1 % %GE %AFTER Dataflow (basic blocks) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi contextual Flow Graph Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv
28 %cmp = fcmp olt double %x, 5 %3 = fmul double 2, %y spclinfethzch Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x br %LT %2 phi fcmp %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi 1 Embedding Dimensions 1 Vocabulary Size (#stmts) Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv
29 Application: Neural Code Comprehension Embedding space (using the Skip-gram model) %x Predicts which device is faster (CPU or GPU) fcmp Optimal Malicious tiling Code Detection br %LT %2 phi %cmp %y fmul %AFTER %4 %5 fadd br %GE %3 phi LSTM Units LSTM Units Guided Programming Code Optimization Optimal Hardware Mapping Ben-Nun et al: Neural Code Comprehension: A Learnable Representation of Code Semantics, arxiv 218 3
30 Outlook Full details in the survey (6 pages) Parallelism, distribution, synchronization Additional content: Unsupervised (GAN/autoencoders) Recurrent (RNN/LSTM) Call to action to the HPC and ML/DL communities to join forces! It s already happening on the tool basis Need more joint events! 31
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object
More informationHigh-Performance Communication in Machine Learning RWTH Aachen, Jan. 2019
T. HOEFLER High-Performance Communication in Machine Learning RWTH Aachen, Jan. 2019 WITH CONTRIBUTIONS FROM TAL BEN-NUN, DAN ALISTARH, SHOSHANA JAKOBOVITS, CEDRIC RENGGLI, AND OTHERS AT SPCL AND IST AUSTRIA
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationDeep Learning Architectures and Algorithms
Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationAdvanced Training Techniques. Prajit Ramachandran
Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationIntroduction to (Convolutional) Neural Networks
Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationMaking Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation
Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation Dr. Yanjun Qi Department of Computer Science University of Virginia Tutorial @ ACM BCB-2018 8/29/18 Yanjun Qi / UVA
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationNeural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28
1 / 28 Neural Networks Mark van Rossum School of Informatics, University of Edinburgh January 15, 2018 2 / 28 Goals: Understand how (recurrent) networks behave Find a way to teach networks to do a certain
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationLecture 14: Deep Generative Learning
Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationOutline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training
Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel
More informationClassification of Hand-Written Digits Using Scattering Convolutional Network
Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationDifferentiable Fine-grained Quantization for Deep Neural Network Compression
Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China
CNTK Microsoft s Open Source Deep Learning Toolkit Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China Deep learning in Microsoft Cognitive Services https://how-old.net http://www.captionbot.ai
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationFeature Design. Feature Design. Feature Design. & Deep Learning
Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by
More informationFreezeOut: Accelerate Training by Progressively Freezing Layers
FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationCompressing deep neural networks
From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationRecurrent Neural Networks
Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationIdentifying QCD transition using Deep Learning
Identifying QCD transition using Deep Learning Kai Zhou Long-Gang Pang, Nan Su, Hannah Peterson, Horst Stoecker, Xin-Nian Wang Collaborators: arxiv:1612.04262 Outline 2 What is deep learning? Artificial
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationIntroduction to Deep Learning
Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2015 A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2015 1 / 39 Outline 1 Universality of Neural Networks
More informationTwo at Once: Enhancing Learning and Generalization Capacities via IBN-Net
Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net Supplementary Material Xingang Pan 1, Ping Luo 1, Jianping Shi 2, and Xiaoou Tang 1 1 CUHK-SenseTime Joint Lab, The Chinese University
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationTYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad
TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.
More informationarxiv: v1 [cs.ne] 20 Apr 2018
arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationLecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationGlobal Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond
Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Ben Haeffele and René Vidal Center for Imaging Science Mathematical Institute for Data Science Johns Hopkins University This
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationCENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan
CENG 783 Special topics in Deep Learning AlchemyAPI Week 8 Sinan Kalkan Loss functions Many correct labels case: Binary prediction for each label, independently: L i = σ j max 0, 1 y ij f j y ij = +1 if
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationNeural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017
Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationDeep Learning Lecture 2
Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationUnderstanding How ConvNets See
Understanding How ConvNets See Slides from Andrej Karpathy Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops) CSC321: Intro to Machine Learning and Neural Networks,
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationRegML 2018 Class 8 Deep learning
RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x
More informationTutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion
Tutorial on Methods for Interpreting and Understanding Deep Neural Networks W. Samek, G. Montavon, K.-R. Müller Part 3: Applications & Discussion ICASSP 2017 Tutorial W. Samek, G. Montavon & K.-R. Müller
More informationLoss Functions and Optimization. Lecture 3-1
Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationA Practitioner s Guide to MXNet
1/34 A Practitioner s Guide to MXNet Xingjian Shi Hong Kong University of Science and Technology (HKUST) HKUST CSE Seminar, March 31st, 2017 2/34 Outline 1 Introduction Deep Learning Basics MXNet Highlights
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationBinary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky
Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing
More informationarxiv: v2 [stat.ml] 18 Jun 2017
FREEZEOUT: ACCELERATE TRAINING BY PROGRES- SIVELY FREEZING LAYERS Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationStochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations
Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Weiran Wang Toyota Technological Institute at Chicago * Joint work with Raman Arora (JHU), Karen Livescu and Nati Srebro (TTIC)
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationDeep Residual. Variations
Deep Residual Network and Its Variations Diyu Yang (Originally prepared by Kaiming He from Microsoft Research) Advantages of Depth Degradation Problem Possible Causes? Vanishing/Exploding Gradients. Overfitting
More informationDeep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.
Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering
More information