Deep Learning Tutorial. 李宏毅 Hung-yi Lee
|
|
- Hilary Bryant
- 5 years ago
- Views:
Transcription
1 Deep Learning Tutorial 李宏毅 Hung-yi Lee
2 Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory
3 Part I: Introduction of Deep Learning What people already knew in 1980s
4 Neural Network Input x 1 x 2 Layer 1 Layer 2 Layer L Output y 1 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers
5 Single Neuron z a 1 w 1 f: R K R a 2 w 2 a K w K b a 1 w 1 a 2 w 2 w K z z y a K weights b bias Activation function
6 Example of Neural Network x y 1 x y Sigmoid Function z 1 1 e z z z
7 Example of Neural Network x y 1 x y f: R 2 R 2 f 1 1 =
8 Example of Neural Network x y 1 x y f: R 2 R 2 f 1 1 = f 0 0 = Different parameters define different function f
9 Matrix Operation x y 1 x a W x b = σ Using parallel computing techniques to speed up matrix operation y
10 Fully Connected Feedforward Neural Network x 1 x 2 y 1 y 2 x N W x b a W, b W 2, b x 1 a W a b a W L, 2 a b L L a y M y L L-1 L L W a b a y
11 Handwriting Digit Recognition Machine 2 Hello World for Deep Learning
12 Handwriting Digit Recognition x 1 y 1 Input is 1 x 2 y 2 Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 Input object should always be represented as a fixed-length vector. y 10 Input is 0 Each output represents the confidence of a digit.
13 Handwriting Digit Recognition x 1 y 1 Input is 1 x 2 y 2 Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 y 10 Input is 0 Hopefully, the function network can Input: y 1 has the maximum value Input: y 2 has the maximum value
14 Handwriting Digit Recognition x 1 y Input is 1 x 2 y Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 y 10 In general, the output of network does not sum to one May not be easy to interpret 0.99 Input is 0
15 Softmax Softmax layer as the output layer Ordinary Output layer z 1 y 1 z 1 z 2 y 2 z 2 z 3 y 3 z 3
16 Softmax Softmax layer as the output layer Softmax Layer Probability: 1 > y i > 0 i y i = 1 z 1 z 2 z 3 3 e 1 e e e z 1 20 e z 2 e z 3 3 j z j e 0.88 y 0.12 y 0 y e e e z z z j 1 3 j 1 3 j 1 e e e z z z j j j
17 Handwriting Digit Recognition x 1 y Input is 1 16 x 16 = 256 Ink 1 No ink 1 x 2 x 256 Hopefully, the function network can Input: Input: Softmax How to let the neural network achieve this y y Input is 2 Input is 0 y 1 has the maximum value y 2 has the maximum value
18 Training Data Preparing training data: images and their labels Find the network parameters from the training data
19 Cost a training data 1 x 1 y x 2 x 256 Softmax y y Cost C 1 0 C: the difference between the output of network and its target C can be Euclidean distance, cross entropy, etc.
20 Cost For all training data x 1 x 2 x R NN NN NN y 1 y 1 C 1 y 2 y 2 C 2 x 3 NN y 3 y 3 C 3 y R y R C R Total Cost: C = R r=1 C r Find the network parameters that can minimize this value
21 Gradient Descent Assume there are only two parameters w 1 and w 2 in a network. The colors represent the value of C. Randomly pick a starting point θ 0 w 2 Compute the negative gradient at θ 0 C θ 0 θ 0 C θ 0 Times the learning rate η C θ 0 w 1
22 Gradient Descent w 2 η C θ 1 θ 2 η C θ 2 C θ 2 η C θ 0 θ 0 Eventually, we would reach a minima.. θ1 θ1 C C θ 0 Randomly pick a starting point θ 0 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1
23 Local Minima Gradient descent never guarantee global minima Different initial point θ 0 C Reach different minima, so different results w 1 w 2
24 Besides local minima cost Very slow at the plateau Stuck at saddle point Stuck at local minima C θ 0 C θ = 0 parameter space C θ = 0
25 In physical world Momentum How about put this phenomenon in gradient descent?
26 Momentum cost Still not guarantee reaching global minima, but give some hope Real Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0
27 Mini-batch Mini-batch Mini-batch Faster Better! x 1 NN y 1 y 1 C 1 x 31 NN y 31 y 31 x 2 NN C 31 y 2 y 2 C 2 x 16 NN y 16 y 16 C 16 Randomly initialize θ 0 Pick the 1 st batch C = C 1 + C 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = C 2 + C 16 + θ 2 θ 1 η C θ 1 Until all mini-batches have been picked one epoch Repeat above process
28 Backpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Many toolkits can compute the gradients automatically Just a function
29 Part II: Why Deep?
30 Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: eplearning.com/chap4.html Why Deep neural network not Fat neural network?
31 Deeper is Better? Word error rate (WER) Multiple layers Not surprised, more parameters, better performance Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech
32 Fat + Short v.s. Thin + Tall The same number of parameters Which one is better? x1 x 2 x N Shallow x1 x 2 x N Deep
33 Fat + Short v.s. Thin + Tall Word error rate (WER) Multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech
34 Why Deep? Consider Logic Circuits A two levels of basic logic gates can represent any Boolean function. However, no one uses two levels of logic gates to build computers Using multiple layers of logic gates to build some functions are much simpler (less gates needed).
35 Why Deep? Modulation Classifier 1 Girls with glasses Image Classifier 2 Boys with glasses Classifier 3 Boys wearing blue shirt
36 Why Deep? Modulation Boy or Girl? simper Classifier 1 Girls with glasses Image With glasses? Classifier 2 Boys with glasses Little Data wearing blue shirt? Classifier 3 Boys wearing blue shirt Multiple Layers
37 Hand-crafted kernel function SVM Deep Learning x 1 Apply simple classifier Source of image: Learnable kernel φ x simple classifier y 1 x x 2 y 2 x N y M
38 Boosting Weak classifier Input x Weak classifier Combine Weak classifier Deep Learning Weak classifier Boosted weak classifier Boosted Boosted weak classifier x 1 x 2 x N
39 Hard to get the power of Deep Before 2006, deeper usually does not imply better.
40 Part III: Tips for Training DNN
41 Recipe for Learning
42 Recipe for Learning Don t forget! Modify the Network Better optimization Strategy Preventing Overfitting overfitting
43 Recipe for Learning Modify the Network New activation function, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rate Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.
44 Part III: Tips for Training DNN New Activation Function
45 ReLU Rectified Linear Unit (ReLU) a = 0 σ z a a = z z Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases [Xavier Glorot, AISTATS 11] 4. Vanishing gradient problem
46 Vanishing Gradient Problem x 1 x 2 y 1 y 2 x N y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?
47 ReLU 0 x 1 x y 1 y 2 0
48 ReLU A Thinner linear network x 1 y 1 x 2 y 2 Do not have smaller gradients
49 ReLU - variant Leaky ReLU [Andrew L. Maas, ICML 13] a a = z Parametric ReLU [Kaiming He, arxiv 15] a a = z a = 0.01z z a = αz z α also learned by gradient descent
50 Maxout All ReLU variants are just special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Input x Max max z 1 1, z Max 2 x Max Max 4 You can have more than 2 elements in a group.
51 Maxout ReLU is special case Input x 1 w b z ReLU a Input x 1 0 w 0 b + z 1 Max a + z 2 max z 1, z 2 a z = wx + b x a z 1 = wx + b x z 2 =0
52 Maxout ReLU is special case Input x 1 w b z ReLU a a Input w x w b 1 z = wx + b x b + z 1 Max a + z 2 max z 1, z 2 Learnable Activation Function a z 1 = wx + b x z 2 = w x + b
53 Part III: Tips for Training DNN Adaptive Learning Rate
54 Learning Rate Set the learning rate η carefully η C θ 0 If learning rate is too large w 2 Cost may not decrease after each update C θ 0 θ 0 w 1
55 Learning Rate Can we give different parameters Set the learning different learning rate η carefully rate? If learning rate is too large w 2 η C θ 0 θ 0 C θ 0 Cost may not decrease after each update If learning rate is too small Training would be too slow w 1
56 Adagrad Divide the learning rate of each parameter w by a parameter dependent value σ w Original Gradient descent w t+1 w t gߟ t Adagrad ߟ t w t+1 w σ w g t Parameter dependent learning rate g t = C θt w σt is parameter dependent σ t = t i=0 g i 2
57 Adagrad w t+1 w t η t i=0 g i 2 g t g 0 g η η = η 0.1 = η 0.22 g 0 g Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller gradient, larger learning rate, and vice versa η 20 2 η = η 20 = η 22 Why?
58 Reason of Adagrad Smaller gradient, larger learning rate, and vice versa Smaller Learning Rate w 2 Larger Learning Rate w 1
59 Not the whole story Adagrad RMSprop Adadelta Adam AdaSecant No more pesky learning rates
60 Part III: Tips for Training DNN Dropout
61 Dropout Pick a mini-batch θ t θ t 1 η C θ t 1 Training: Each time before computing the gradients Each neuron has p% to dropout
62 Pick a mini-batch Dropout θ t θ t 1 η C θ t 1 Training: Thinner! Each time before computing the gradients Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons
63 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (1-p)% Assume that the dropout rate is 50%. If a weight w = 1 by training, set w = 0.5 for testing.
64 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.
65 Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w 1 w 2 w 3 w 4 z No dropout w 1 w 2 w 3 w 4 Weights from training z 2z z Weights multiply (1-p)% z z
66 Dropout is a kind of ensemble. Ensemble Training Set Set 1 Set 2 Set 3 Set 4 Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures
67 Dropout is a kind of ensemble. Ensemble Testing data x Network 1 Network 2 Network 3 Network 4 y 1 y 2 y 3 y 4 average
68 Dropout is a kind of ensemble. minibatch 1 minibatch 2 minibatch 3 minibatch 4 Training of Dropout M neurons 2 M possible networks Using one mini-batch to train one network Some parameters in the network are shared
69 Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply (1-p)% y 1 y 2 y 3 average y
70 More about dropout Dropout works better with Maxout [Ian J. Goodfellow, ICML 13] More reference for dropout [Nitish Srivastava, JMLR 14] [Pierre Baldi, NIPS 13][Geoffrey E. Hinton, arxiv 12] Dropconnect [Li Wan, ICML 13] Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout [S.J. Rennie, SLT 14] Dropout rate decreases by epochs Standout [J. Ba, NISP 13] Each neural has different dropout rate
71 Part IV: Neural Network with Memory
72 Neural Network needs Memory Input and output are sequences E.g. Name Entity Recognition (NER) NOT Detecting named entities like name of people, locations, organization, etc. in a sentence. NOT NOT ORG NOT NOT NOT y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple
73 Neural Network needs Memory Input and output are sequences E.g. Name Entity Recognition (NER) NOT DNN needs memory! NOT NOT ORG NOT NOT NOT y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple
74 Recurrent Neural Network (RNN) y 1 y 2 The values are stored by RNN in the previous step. a1 a2 Memory can be considered as another input. x1 x2
75 RNN y 1 RNN input: x 1 x 2 x 3 x N y 1 =softmax(w o a 1 ) memory 0 a 1 copy W h W o a 1 =σ(w i x 1 +W h 0) W i x 1
76 RNN y 1 y 2 RNN input: x 1 x 2 x 3 x N y 2 =softmax(w o a 2 ) memory 0 a 1 a 2 copy W h W o a 2 =σ(w i x 2 +W h a 1 ) W i x 2
77 RNN y 1 y 2 y 3 RNN input: x 1 x 2 x 3 x N y 3 =softmax(w o a 3 ) memory 0 a 1 a 2 a 3 copy W h W o a 3 =σ(w i x 3 +W h a 2 ) W i x 3
78 RNN y 1 y 2 y 3 init 0 W o copy copy W a o 1 a 2 a 3 a 1 a 2 W i W o W h W h W h W i x 1 x 2 x 3 W i The same network is used again and again. Output y i depends on x 1, x 2, x i
79 RNN y 1 y 2 y 3 init 0 W h W o W h W o W h W o W i W i x 1 x 2 x 3 W i The same network is used again and again. Output y i depends on x 1, x 2, x i
80 Training y 1 y 2 y 3 y 1 y 2 y 3 init 0 W h W o W h W o W h W o W i W i x 1 x 2 x 3 W i Backpropagation through time (BPTT)
81 Of course it can be deep y t y t+1 y t+2 x t x t+1 x t+2
82 Bidirectional RNN x t x t+1 x t+2 y t y t+1 y t+2 x t x t+1 x t+2
83 Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. E.g. Speech Recognition Problem? Why can t it be 好棒棒 Output: Input: 好棒 好好好 (character sequence) Trimming 棒棒棒棒棒 (vector sequence)
84 Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. Connectionist Temporal Classification (CTC) [Alex Graves, ICML 06][Alex Graves, ICML 14][Haşim Sak, Interspeech 15][Jie Li, Interspeech 15][Andrew Senior, ASRU 15] 好棒 好棒棒 Add an extra symbol φ representing null 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
85 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) machine learning Containing all information about input sequence
86 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 慣 性 machine learning Don t know when to stop
87 Many to Many (No Limitation) 推 tlkagk: ========= 斷 ========== Ref: E6%8E%A8%E6%96%87 ( 鄉民百科 )
88 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 === machine learning Add a symbol === ( 斷 ) [Ilya Sutskever, NIPS 14][Dzmitry Bahdanau, arxiv 15]
89 The error surface is rough. The error surface is either very flat or very steep. Clipping Cost w 2 w 1 [Razvan Pascanu, ICML 13]
90 Toy Example w = 1 w = 1.01 y 1000 = 1 y Large gradient Small Learning rate? w = 0.99 w = 0.01 y y small gradient Large Learning rate? y 1 y 2 y 3 y w w w
91 Helpful Techniques NAG: Advance momentum method RMS Prop Advanced Adagrad Second derivative change Long Short-term Memory (LSTM) Can deal with the gradient vanishing problem
92 Long Short-term Memory (LSTM) Signal control the output gate (Other part of the network) Signal control the input gate (Other part of the network) Other part of the network Output Gate Memory Cell Input Gate Other part of the network Forget Gate LSTM Special Neuron: 4 inputs, 1 output Signal control the forget gate (Other part of the network)
93 a = h c f z o z o f z o multiply h c Activation function f is usually a sigmoid function Between 0 and 1 c f z f Mimic open and close gate c c z f cf z f z i f z i g z f z i multiply g z c = g z f z i + cf z f z
94 Original Network: Simply replace the neurons with LSTM a 1 a 2 z 1 z 2 x 1 x 2 Input
95 + a 1 a times of parameters x 1 x 2 Input
96 LSTM Extension: peephole y t y t+1 c t 1 c t c t z f z i z z o z f z i z z o c t h t x t c t+1 h t+1 x t+1
97 Other Simpler Alternatives Gated Recurrent Unit (GRU) Structurally Constrained Recurrent Network (SCRN) [Cho. EMNLP 14] [Tomas Mikolov, ICLR 15] Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arxiv 15] Outperform or be comparable with LSTM in four different tasks
98 What is the next wave? Attention-based Model Reading Head Reading Head Controller Writing Head Memory Writing Head Controller Input x DNN output y
99 Concluding Remarks
100 Concluding Remarks Introduction of deep learning Discussing some reasons using deep learning New techniques for deep learning ReLU, Maxout Giving all the parameters different learning rates Dropout Network with memory Recurrent neural network Long short-term memory (LSTM)
101 Thank you for your attention!
Based on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.
More informationDeep Learning Tutorial. 李宏毅 Hung-yi Lee
Deep Learning Tutorial 李宏毅 Hung-yi Lee Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationMore Tips for Training Neural Network. Hung-yi Lee
More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)
More informationDeep learning attracts lots of attention.
Deep Learning Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean Ups and downs of Deep Learning
More informationDeep Learning, Data Irregularities and Beyond
Deep Learning, Data Irregularities and Beyond Dr. Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata 700 108, India. E-mail: swagatam.das@isical.ac.in Road
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationDeep Learning. Hung-yi Lee 李宏毅
Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationHigh Order LSTM/GRU. Wenjie Luo. January 19, 2016
High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationEE-559 Deep learning LSTM and GRU
EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes
More informationFaster Training of Very Deep Networks Via p-norm Gates
Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationRECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS
2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationDeep Recurrent Neural Networks
Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationarxiv: v2 [cs.ne] 7 Apr 2015
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationGated RNN & Sequence Generation. Hung-yi Lee 李宏毅
Gated RNN & Sequence Generation Hung-yi Lee 李宏毅 Outline RNN with Gated Mechanism Sequence Generation Conditional Sequence Generation Tips for Generation RNN with Gated Mechanism Recurrent Neural Network
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationUnderstanding Neural Networks : Part I
TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationDeep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data
More informationRecurrent neural networks
12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationMachine Learning
Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationEE-559 Deep learning Recurrent Neural Networks
EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationDeep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang
Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationarxiv: v1 [cs.lg] 11 May 2015
Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,
More informationCSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan
CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan Overview Background Why another network structure? Vanishing and exploding
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationTraining Neural Networks Practical Issues
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationLarge Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks
Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Haşim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Françoise Beaufays
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationDeep learning for Natural Language Processing and Machine Translation
Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural
More informationRecurrent and Recursive Networks
Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.
More informationMaxout Networks. Hien Quoc Dang
Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationCOMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-
Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing
More informationNeural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL
Neural Network Tutorial & Application in Nuclear Physics Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Machine Learning Logistic Regression Gaussian Processes Neural Network Support vector machine Random Forest Genetic
More informationAnalysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function
Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationLecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom
Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationARTIFICIAL INTELLIGENCE. Artificial Neural Networks
INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationStephen Scott.
1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological
More informationRecurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST
1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson
More informationSpatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015
Spatial Transormer Re: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transormer Networks, NIPS, 2015 Spatial Transormer Layer CNN is not invariant to scaling and rotation
More information