Deep Learning Tutorial 李宏毅 Hung-yi Lee
Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory
Part I: Introduction of Deep Learning What people already knew in 1980s
Neural Network Input x 1 x 2 Layer 1 Layer 2 Layer L Output y 1 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers
Single Neuron z a 1 w 1 f: R K R a 2 w 2 a K w K b a 1 w 1 a 2 w 2 w K z z y a K weights b bias Activation function
Example of Neural Network x 1 1 1-2 4 1 0.98 2-1 3-1 0-2 y 1 x2 2-1 1-1 -2 0 0.12-1 -2 0 4-1 2 y Sigmoid Function z 1 1 e z z z
Example of Neural Network x 1 1 1-2 4 1 0.98 2-1 0 0.86 3-1 -2 0.62 y 1 x2 2-1 1-1 -2 0 0.12-1 -2 0 0.11 4-1 2 0.83 y f: R 2 R 2 f 1 1 = 0.62 0.83
Example of Neural Network x 1 0 1-2 1 0.73 2-1 0 0.72 3-1 -2 0.51 y 1 x2 2 0 1-1 0 0.5-1 -2 0 0.12 4-1 2 0.85 y f: R 2 R 2 f 1 1 = 0.62 0.83 f 0 0 = 0.51 0.85 Different parameters define different function f
Matrix Operation x 1 1 1-2 4 1 0.98 y 1 x2 2-1 1-1 -2 0 0.12 a W x b 0.98 1 2 1 1 = σ + 0.12 1 1 1 0 Using parallel computing techniques to speed up matrix operation y
Fully Connected Feedforward Neural Network x 1 x 2 y 1 y 2 x N 1 1 1 W x b a 1 1 2 W, b W 2, b x 1 a 2 1 2 2 W a b a W L, 2 a b L L a y M y L L-1 L L W a b a y
Handwriting Digit Recognition Machine 2 Hello World for Deep Learning
Handwriting Digit Recognition x 1 y 1 Input is 1 x 2 y 2 Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 Input object should always be represented as a fixed-length vector. y 10 Input is 0 Each output represents the confidence of a digit.
Handwriting Digit Recognition x 1 y 1 Input is 1 x 2 y 2 Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 y 10 Input is 0 Hopefully, the function network can Input: y 1 has the maximum value Input: y 2 has the maximum value
Handwriting Digit Recognition x 1 y 1 0.8 Input is 1 x 2 y 2 0.9 Input is 2 16 x 16 = 256 Ink 1 No ink 1 x 256 y 10 In general, the output of network does not sum to one May not be easy to interpret 0.99 Input is 0
Softmax Softmax layer as the output layer Ordinary Output layer z 1 y 1 z 1 z 2 y 2 z 2 z 3 y 3 z 3
Softmax Softmax layer as the output layer Softmax Layer Probability: 1 > y i > 0 i y i = 1 z 1 z 2 z 3 3 e 1 e 2.7-3 e e z 1 20 e z 2 e z 3 3 j 1 0.05 z j e 0.88 y 0.12 y 0 y 1 2 3 e e e z z z 1 2 3 3 j 1 3 j 1 3 j 1 e e e z z z j j j
Handwriting Digit Recognition x 1 y 1 0.1 Input is 1 16 x 16 = 256 Ink 1 No ink 1 x 2 x 256 Hopefully, the function network can Input: Input: Softmax How to let the neural network achieve this y 2 0.7 y0.2 10 Input is 2 Input is 0 y 1 has the maximum value y 2 has the maximum value
Training Data Preparing training data: images and their labels 5 0 4 1 9 2 1 3 Find the network parameters from the training data
Cost a training data 1 x 1 y 1 0.2 0 x 2 x 256 Softmax y 2 0.3 y 10 0.5 Cost C 1 0 C: the difference between the output of network and its target C can be Euclidean distance, cross entropy, etc.
Cost For all training data x 1 x 2 x R NN NN NN y 1 y 1 C 1 y 2 y 2 C 2 x 3 NN y 3 y 3 C 3 y R y R C R Total Cost: C = R r=1 C r Find the network parameters that can minimize this value
Gradient Descent Assume there are only two parameters w 1 and w 2 in a network. The colors represent the value of C. Randomly pick a starting point θ 0 w 2 Compute the negative gradient at θ 0 C θ 0 θ 0 C θ 0 Times the learning rate η C θ 0 w 1
Gradient Descent w 2 η C θ 1 θ 2 η C θ 2 C θ 2 η C θ 0 θ 0 Eventually, we would reach a minima.. θ1 θ1 C C θ 0 Randomly pick a starting point θ 0 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1
Local Minima Gradient descent never guarantee global minima Different initial point θ 0 C Reach different minima, so different results w 1 w 2
Besides local minima cost Very slow at the plateau Stuck at saddle point Stuck at local minima C θ 0 C θ = 0 parameter space C θ = 0
In physical world Momentum How about put this phenomenon in gradient descent?
Momentum cost Still not guarantee reaching global minima, but give some hope Real Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0
Mini-batch Mini-batch Mini-batch Faster Better! x 1 NN y 1 y 1 C 1 x 31 NN y 31 y 31 x 2 NN C 31 y 2 y 2 C 2 x 16 NN y 16 y 16 C 16 Randomly initialize θ 0 Pick the 1 st batch C = C 1 + C 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = C 2 + C 16 + θ 2 θ 1 η C θ 1 Until all mini-batches have been picked one epoch Repeat above process
Backpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Many toolkits can compute the gradients automatically Just a function
Part II: Why Deep?
Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: http://neuralnetworksandde eplearning.com/chap4.html Why Deep neural network not Fat neural network?
Deeper is Better? Word error rate (WER) Multiple layers Not surprised, more parameters, better performance Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Fat + Short v.s. Thin + Tall The same number of parameters Which one is better? x1 x 2 x N Shallow x1 x 2 x N Deep
Fat + Short v.s. Thin + Tall Word error rate (WER) Multiple layers 1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep? Consider Logic Circuits A two levels of basic logic gates can represent any Boolean function. However, no one uses two levels of logic gates to build computers Using multiple layers of logic gates to build some functions are much simpler (less gates needed).
Why Deep? Modulation Classifier 1 Girls with glasses Image Classifier 2 Boys with glasses Classifier 3 Boys wearing blue shirt
Why Deep? Modulation Boy or Girl? simper Classifier 1 Girls with glasses Image With glasses? Classifier 2 Boys with glasses Little Data wearing blue shirt? Classifier 3 Boys wearing blue shirt Multiple Layers
Hand-crafted kernel function SVM Deep Learning x 1 Apply simple classifier Source of image: http://www.gipsa-lab.grenobleinp.fr/transfert/seminaire/455_kadri2013gipsa-lab.pdf Learnable kernel φ x simple classifier y 1 x x 2 y 2 x N y M
Boosting Weak classifier Input x Weak classifier Combine Weak classifier Deep Learning Weak classifier Boosted weak classifier Boosted Boosted weak classifier x 1 x 2 x N
Hard to get the power of Deep Before 2006, deeper usually does not imply better.
Part III: Tips for Training DNN
Recipe for Learning http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/
Recipe for Learning Don t forget! Modify the Network Better optimization Strategy Preventing Overfitting overfitting http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/
Recipe for Learning Modify the Network New activation function, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rate Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.
Part III: Tips for Training DNN New Activation Function
ReLU Rectified Linear Unit (ReLU) a = 0 σ z a a = z z Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases [Xavier Glorot, AISTATS 11] 4. Vanishing gradient problem
Vanishing Gradient Problem x 1 x 2 y 1 y 2 x N y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?
ReLU 0 x 1 x 2 0 0 y 1 y 2 0
ReLU A Thinner linear network x 1 y 1 x 2 y 2 Do not have smaller gradients
ReLU - variant Leaky ReLU [Andrew L. Maas, ICML 13] a a = z Parametric ReLU [Kaiming He, arxiv 15] a a = z a = 0.01z z a = αz z α also learned by gradient descent
Maxout All ReLU variants are just special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Input x 1 + 5 + 7 Max max z 1 1, z 2 1 7 + 1 + 2 Max 2 x 2 + 1 + 4 + 1 Max 1 + 3 Max 4 You can have more than 2 elements in a group.
Maxout ReLU is special case Input x 1 w b z ReLU a Input x 1 0 w 0 b + z 1 Max a + z 2 max z 1, z 2 a z = wx + b x a z 1 = wx + b x z 2 =0
Maxout ReLU is special case Input x 1 w b z ReLU a a Input w x w b 1 z = wx + b x b + z 1 Max a + z 2 max z 1, z 2 Learnable Activation Function a z 1 = wx + b x z 2 = w x + b
Part III: Tips for Training DNN Adaptive Learning Rate
Learning Rate Set the learning rate η carefully η C θ 0 If learning rate is too large w 2 Cost may not decrease after each update C θ 0 θ 0 w 1
Learning Rate Can we give different parameters Set the learning different learning rate η carefully rate? If learning rate is too large w 2 η C θ 0 θ 0 C θ 0 Cost may not decrease after each update If learning rate is too small Training would be too slow w 1
Adagrad Divide the learning rate of each parameter w by a parameter dependent value σ w Original Gradient descent w t+1 w t gߟ t Adagrad ߟ t w t+1 w σ w g t Parameter dependent learning rate g t = C θt w σt is parameter dependent σ t = t i=0 g i 2
Adagrad w t+1 w t η t i=0 g i 2 g t g 0 g 1 0.1 0.2 η 0.1 2 η 0.1 2 + 0.2 2 = η 0.1 = η 0.22 g 0 g 1 20.0 10.0 Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller gradient, larger learning rate, and vice versa η 20 2 η 20 2 + 10 2 = η 20 = η 22 Why?
Reason of Adagrad Smaller gradient, larger learning rate, and vice versa Smaller Learning Rate w 2 Larger Learning Rate w 1
Not the whole story Adagrad RMSprop Adadelta Adam AdaSecant No more pesky learning rates
Part III: Tips for Training DNN Dropout
Dropout Pick a mini-batch θ t θ t 1 η C θ t 1 Training: Each time before computing the gradients Each neuron has p% to dropout
Pick a mini-batch Dropout θ t θ t 1 η C θ t 1 Training: Thinner! Each time before computing the gradients Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons
Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (1-p)% Assume that the dropout rate is 50%. If a weight w = 1 by training, set w = 0.5 for testing.
Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.
Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w 1 w 2 w 3 w 4 z No dropout 0.5 0.5 0.5 0.5 w 1 w 2 w 3 w 4 Weights from training z 2z z Weights multiply (1-p)% z z
Dropout is a kind of ensemble. Ensemble Training Set Set 1 Set 2 Set 3 Set 4 Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures
Dropout is a kind of ensemble. Ensemble Testing data x Network 1 Network 2 Network 3 Network 4 y 1 y 2 y 3 y 4 average
Dropout is a kind of ensemble. minibatch 1 minibatch 2 minibatch 3 minibatch 4 Training of Dropout M neurons 2 M possible networks Using one mini-batch to train one network Some parameters in the network are shared
Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply (1-p)% y 1 y 2 y 3 average y
More about dropout Dropout works better with Maxout [Ian J. Goodfellow, ICML 13] More reference for dropout [Nitish Srivastava, JMLR 14] [Pierre Baldi, NIPS 13][Geoffrey E. Hinton, arxiv 12] Dropconnect [Li Wan, ICML 13] Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout [S.J. Rennie, SLT 14] Dropout rate decreases by epochs Standout [J. Ba, NISP 13] Each neural has different dropout rate
Part IV: Neural Network with Memory
Neural Network needs Memory Input and output are sequences E.g. Name Entity Recognition (NER) NOT Detecting named entities like name of people, locations, organization, etc. in a sentence. NOT NOT ORG NOT NOT NOT y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple
Neural Network needs Memory Input and output are sequences E.g. Name Entity Recognition (NER) NOT DNN needs memory! NOT NOT ORG NOT NOT NOT y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple
Recurrent Neural Network (RNN) y 1 y 2 The values are stored by RNN in the previous step. a1 a2 Memory can be considered as another input. x1 x2
RNN y 1 RNN input: x 1 x 2 x 3 x N y 1 =softmax(w o a 1 ) memory 0 a 1 copy W h W o a 1 =σ(w i x 1 +W h 0) W i x 1
RNN y 1 y 2 RNN input: x 1 x 2 x 3 x N y 2 =softmax(w o a 2 ) memory 0 a 1 a 2 copy W h W o a 2 =σ(w i x 2 +W h a 1 ) W i x 2
RNN y 1 y 2 y 3 RNN input: x 1 x 2 x 3 x N y 3 =softmax(w o a 3 ) memory 0 a 1 a 2 a 3 copy W h W o a 3 =σ(w i x 3 +W h a 2 ) W i x 3
RNN y 1 y 2 y 3 init 0 W o copy copy W a o 1 a 2 a 3 a 1 a 2 W i W o W h W h W h W i x 1 x 2 x 3 W i The same network is used again and again. Output y i depends on x 1, x 2, x i
RNN y 1 y 2 y 3 init 0 W h W o W h W o W h W o W i W i x 1 x 2 x 3 W i The same network is used again and again. Output y i depends on x 1, x 2, x i
Training y 1 y 2 y 3 y 1 y 2 y 3 init 0 W h W o W h W o W h W o W i W i x 1 x 2 x 3 W i Backpropagation through time (BPTT)
Of course it can be deep y t y t+1 y t+2 x t x t+1 x t+2
Bidirectional RNN x t x t+1 x t+2 y t y t+1 y t+2 x t x t+1 x t+2
Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. E.g. Speech Recognition Problem? Why can t it be 好棒棒 Output: Input: 好棒 好好好 (character sequence) Trimming 棒棒棒棒棒 (vector sequence)
Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. Connectionist Temporal Classification (CTC) [Alex Graves, ICML 06][Alex Graves, ICML 14][Haşim Sak, Interspeech 15][Jie Li, Interspeech 15][Andrew Senior, ASRU 15] 好棒 好棒棒 Add an extra symbol φ representing null 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) machine learning Containing all information about input sequence
Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 慣 性 machine learning Don t know when to stop
Many to Many (No Limitation) 推 tlkagk: ========= 斷 ========== Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D% E6%8E%A8%E6%96%87 ( 鄉民百科 )
Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 === machine learning Add a symbol === ( 斷 ) [Ilya Sutskever, NIPS 14][Dzmitry Bahdanau, arxiv 15]
The error surface is rough. The error surface is either very flat or very steep. Clipping Cost w 2 w 1 [Razvan Pascanu, ICML 13]
Toy Example w = 1 w = 1.01 y 1000 = 1 y 1000 20000 Large gradient Small Learning rate? w = 0.99 w = 0.01 y 1000 0 y 1000 0 small gradient Large Learning rate? y 1 y 2 y 3 y 1000 1 1 1 1 w w w 1 1 1 1 1 0 0 0
Helpful Techniques NAG: Advance momentum method RMS Prop Advanced Adagrad Second derivative change Long Short-term Memory (LSTM) Can deal with the gradient vanishing problem
Long Short-term Memory (LSTM) Signal control the output gate (Other part of the network) Signal control the input gate (Other part of the network) Other part of the network Output Gate Memory Cell Input Gate Other part of the network Forget Gate LSTM Special Neuron: 4 inputs, 1 output Signal control the forget gate (Other part of the network)
a = h c f z o z o f z o multiply h c Activation function f is usually a sigmoid function Between 0 and 1 c f z f Mimic open and close gate c c z f cf z f z i f z i g z f z i multiply g z c = g z f z i + cf z f z
Original Network: Simply replace the neurons with LSTM a 1 a 2 z 1 z 2 x 1 x 2 Input
+ a 1 a 2 + + + + + + + 4 times of parameters x 1 x 2 Input
LSTM Extension: peephole y t y t+1 c t 1 c t c t+1 + + z f z i z z o z f z i z z o c t h t x t c t+1 h t+1 x t+1
Other Simpler Alternatives Gated Recurrent Unit (GRU) Structurally Constrained Recurrent Network (SCRN) [Cho. EMNLP 14] [Tomas Mikolov, ICLR 15] Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arxiv 15] Outperform or be comparable with LSTM in four different tasks
What is the next wave? Attention-based Model Reading Head Reading Head Controller Writing Head Memory Writing Head Controller Input x DNN output y
Concluding Remarks
Concluding Remarks Introduction of deep learning Discussing some reasons using deep learning New techniques for deep learning ReLU, Maxout Giving all the parameters different learning rates Dropout Network with memory Recurrent neural network Long short-term memory (LSTM)
Thank you for your attention!