Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial 李宏毅 Hung-yi Lee

Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical part. 2007 2009 2011 2013 2015

Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory

Part I: Introduction of Deep Learning What people already knew in 1980s

Example Application Handwriting Digit Recognition Machine 2

Handwriting Digit Recognition Input Output x 1 y 1 0.1 is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 y 2 0.7 y 10 0.2 is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

Example Application Handwriting Digit Recognition x 1 y 1 x 2 y 2 Machine 2 x 256 f: R 256 R 10 y 10 In deep learning, the function f is represented by neural network

Element of Neural Network Neuron f: R K R a 1 w 1 z a 1 w 1 a 2 w 2 a K w K b a 2 w 2 w K z z a a K weights b bias Activation function

Neural Network neuron Input x 1 Layer 1 Layer 2 Layer L Output y 1 x 2 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers

Example of Neural Network 1 1-2 4 1 0.98-1 -1 1-2 0 0.12 Sigmoid Function z 1 1 e z z z

Example of Neural Network 1 1-2 4 1 0.98 2-1 0 0.86 3-1 -2 0.62-1 -1 1-2 0 0.12-2 -1 0 0.11-1 4 2 0.83

Example of Neural Network 0 1-2 1 0.73 2-1 0 0.72 3-1 -2 0.51 0-1 1 0 0.5-2 -1 0 0.12-1 4 2 0.85 f: R 2 R 2 f 1 1 = 0.62 0.83 f 0 0 = 0.51 0.85 Different parameters define different function

Matrix Operation 1 1-2 4 1 0.98 y 1-1 -1 1-2 0 0.12 y 2 σ 1 2 1 1 1 1 + 1 0 = 0.98 0.12 4 2

Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M σ W 1 x + b 1 σ W 2 a 1 + b 2 σ W L a L-1 + b L

Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W 1 x + b 1 + b 2 + b L

Softmax Softmax layer as the output layer Ordinary Layer z 1 z 2 y y 1 z 1 2 z 2 In general, the output of network can be any value. z 3 y 3 z 3 May not be easy to interpret

Softmax Softmax layer as the output layer Softmax Layer Probability: 1 > y i > 0 i y i = 1 z 1 z 2 z 3 3 e 1 e 2.7-3 e e z 1 20 e z 2 e z 3 3 j 1 0.05 z j e 0.88 y 0.12 y 0 y 1 2 3 e e e z z z 1 2 3 3 j 1 3 j 1 3 j 1 e e e z z z j j j

How to set network parameters x 1 θ = W 1, b 1, W 2, b 2, W L, b L y 1 0.1 is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 y 2 0.7 0.2 y 10 Set the network parameters θ such that Input: Input: Softmax How to let the neural network achieve this is 2 is 0 y 1 has the maximum value y 2 has the maximum value

Training Data Preparing training data: images and their labels 5 0 4 1 9 2 1 3 Using the training data to find the network parameters.

Cost Given a set of network parameters θ, each example has a cost value. 1 x 1 y 1 0.2 1 x 2 x 256 y 2 0.3 y 10 0.5 Cost L(θ) 0 0 Cost can be Euclidean distance or cross entropy of the network output and target target

Total Cost For all training data x 1 x 2 x R NN NN NN y 1 y 2 L 1 θ L 2 θ y 1 y 2 x 3 NN y 3 y 3 L 3 θ y R y R L R θ Total Cost: C θ = R r=1 L r θ How bad the network parameters θ is on this task Find the network parameters θ that minimize this value

Gradient Descent Error Surface The colors represent the value of C. Assume there are only two parameters w 1 and w 2 in a network. θ = w 1, w 2 Randomly pick a starting point θ 0 w 2 η C θ 0 θ 0 C θ 0 θ C θ 0 = C θ0 / w 1 C θ 0 / w 2 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1

Gradient Descent w 2 θ 0 Eventually, we would reach a minima.. η C θ 1 θ 2 η C θ 2 C θ1 θ 2 θ1 C Randomly pick a starting point θ 0 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1

Local Minima Gradient descent never guarantee global minima Different initial point θ 0 C Reach different minima, so different results w 1 w 2 Who is Afraid of Non-Convex Loss Functions? http://videolectures.net/eml07 _lecun_wia/

Besides local minima cost Very slow at the plateau Stuck at saddle point Stuck at local minima C θ 0 C θ = 0 parameter space C θ = 0

In physical world Momentum How about put this phenomenon in gradient descent?

Momentum cost Still not guarantee reaching global minima, but give some hope Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0

Mini-batch Mini-batch Mini-batch x 1 NN y 1 y 1 L 1 x 31 NN y 31 y 31 x 2 NN L 31 y 2 y 2 L 2 x 16 NN y 16 y 16 L 16 Randomly initialize θ 0 Pick the 1 st batch C = L 1 + L 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = L 2 + L 16 + θ 2 θ 1 η C θ 1 C is different each time when we update parameters!

Mini-batch Original Gradient Descent With Mini-batch unstable The colors represent the total C on all training data.

Mini-batch Mini-batch Mini-batch Faster Better! x 1 NN y 1 y 1 C 1 x 31 NN y 31 y 31 x 2 NN C 31 y 2 y 2 C 2 x 16 NN y 16 y 16 C 16 Randomly initialize θ 0 Pick the 1 st batch C = C 1 + C 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = C 2 + C 16 + θ 2 θ 1 η C θ 1 Until all mini-batches have been picked one epoch Repeat the above process

Backpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_201 5_2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automatically Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2015_2/lec ture/theano%20dnn.ecm.mp4/index.html

Part II: Why Deep?

Deeper is Better? Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: http://neuralnetworksandde eplearning.com/chap4.html Why Deep neural network not Fat neural network?

Fat + Short v.s. Thin + Tall The same number of parameters Which one is better? x1 x 2 x N Shallow x1 x 2 x N Deep

Fat + Short v.s. Thin + Tall Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 Layer X Size Word Error Rate (%) 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Why Deep? Deep Modularization Image Classifier 1 Classifier 2 Classifier 3 Classifier 4 weak Girls with long hair Boys with long hair Girls with short hair Boys with short hair 長髮長髮女長髮女長髮女女長髮男 Little examples 短髮短髮女短髮女短髮女女短髮短髮男短髮男短髮男男

Why Deep? Each basic classifier can have sufficient training examples. Deep Modularization Image Boy or Girl? Basic Classifier Long or short? Classifiers for the attributes 長髮長髮短髮女長髮短髮女長髮女短髮女女短髮女女女長髮長髮女長髮女長髮女女長髮男 v.s. v.s. 長髮男短髮短髮男短髮男短髮男男短髮短髮女短髮女短髮女短髮女短髮男短髮男短髮男男

Why Deep? can be trained by little data Deep Modularization Classifier 1 Girls with long hair Image Boy or Girl? Basic Classifier Long or short? Sharing by the following classifiers as module Classifier 2 fine Classifier 3 Classifier 4 Boys with long Little hair data Girls with short hair Boys with short hair

Why Deep? Deep Learning also works on small data set like TIMIT. Deep Modularization x 1 Less training data? x 2 x N The modularization is automatically learned from data. The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as module

Hand-crafted kernel function SVM Deep Learning x 1 Apply simple classifier Source of image: http://www.gipsa-lab.grenobleinp.fr/transfert/seminaire/455_kadri2013gipsa-lab.pdf Learnable kernel φ x simple classifier y 1 x x 2 y 2 x N y M

Hard to get the power of Deep Before 2006, deeper usually does not imply better.

Part III: Tips for Training DNN

Recipe for Learning http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/

Recipe for Learning Don t forget! Modify the Network Better optimization Strategy Preventing Overfitting overfitting http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/

Recipe for Learning Modify the Network New activation functions, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rates Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.

Part III: Tips for Training DNN New Activation Function

ReLU Rectified Linear Unit (ReLU) σ z a = 0 a a = z [Xavier Glorot, AISTATS 11] [Andrew L. Maas, ICML 13] [Kaiming He, arxiv 15] z Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

Vanishing Gradient Problem x 1 x N In x2006, 2 people used RBM pre-training. In 2015, people use ReLU. y 1 y 2 y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?

Vanishing Gradient Problem Smaller gradients x 1 x 2 Small output y 1 y 2 y 1 y 2 x N + w y M C + C Large input y M Intuitive way to compute the gradient C w =? C w

ReLU a a = z 0 a = 0 z x 1 x 2 0 0 y 1 y 2 0

ReLU a a = z A Thinner linear network a = 0 z x 1 y 1 x 2 y 2 Do not have smaller gradients

Maxout ReLU is a special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Input x 1 + 5 + 7 neuron Max 7 + 1 + 2 Max 2 x 2 + 1 + 4 + 1 Max 1 + 3 Max 4 You can have more than 2 elements in a group.

Maxout ReLU is a special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group 2 elements in a group 3 elements in a group

Part III: Tips for Training DNN Adaptive Learning Rate

Learning Rate Set the learning rate η carefully η C θ 0 If learning rate is too large w 2 Cost may not decrease after each update C θ 0 θ 0 w 1

Learning Rate Can we give different parameters Set the learning different learning rate η carefully rates? If learning rate is too large w 2 η C θ 0 θ 0 C θ 0 Cost may not decrease after each update If learning rate is too small Training would be too slow w 1

Adagrad Original Gradient Descent θ t θ t 1 η C θ t 1 Each parameter w are considered separately w t+1 w t ߟ w g t g t = C θt w Parameter dependent learning rate ߟ w = η constant i=0 t g i 2 Summation of the square of the previous derivatives

Adagrad ߟ w = η t i=0 g i 2 Learning rate: η g 0 g 1 w 1 w 2 0.1 0.2 0.1 2 η 0.1 2 + 0.2 2 = η 0.1 = η 0.22 g 0 g 1 20.0 10.0 Learning rate: η 20 2 η 20 2 + 10 2 Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller derivatives, larger learning rate, and vice versa = η 20 = η 22 Why?

Larger derivatives Smaller Learning Rate Smaller Derivatives Larger Learning Rate 2. Smaller derivatives, larger learning rate, and vice versa Why?

Not the whole story Adagrad [John Duchi, JMLR 11] RMSprop https://www.youtube.com/watch?v=o3sxac4hxzu Adadelta [Matthew D. Zeiler, arxiv 12] Adam [Diederik P. Kingma, ICLR 15] AdaSecant [Caglar Gulcehre, arxiv 14] No more pesky learning rates [Tom Schaul, arxiv 12]

Part III: Tips for Training DNN Dropout

Dropout Pick a mini-batch θ t θ t 1 η C θ t 1 Training: Each time before computing the gradients Each neuron has p% to dropout

Pick a mini-batch Dropout θ t θ t 1 η C θ t 1 Training: Thinner! Each time before computing the gradients Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons

Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (1-p)% Assume that the dropout rate is 50%. If a weight w = 1 by training, set w = 0.5 for testing.

Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w 1 w 2 w 3 w 4 z No dropout 0.5 0.5 0.5 0.5 w 1 w 2 w 3 w 4 Weights from training z 2z z Weights multiply (1-p)% z z

Dropout is a kind of ensemble. Ensemble Training Set Set 1 Set 2 Set 3 Set 4 Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures

Dropout is a kind of ensemble. Ensemble Testing data x Network 1 Network 2 Network 3 Network 4 y 1 y 2 y 3 y 4 average

Dropout is a kind of ensemble. minibatch 1 minibatch 2 minibatch 3 minibatch 4 Training of Dropout M neurons 2 M possible networks Using one mini-batch to train one network Some parameters in the network are shared

Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply (1-p)% y 1 y 2 y 3 average y

More about dropout More reference for dropout [Nitish Srivastava, JMLR 14] [Pierre Baldi, NIPS 13][Geoffrey E. Hinton, arxiv 12] Dropout works better with Maxout [Ian J. Goodfellow, ICML 13] Dropconnect [Li Wan, ICML 13] Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout [S.J. Rennie, SLT 14] Dropout rate decreases by epochs Standout [J. Ba, NISP 13] Each neural has different dropout rate

Part IV: Neural Network with Memory

Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. apple 1 0 0 0 DNN 0.1 0.1 0.5 0.3 people location organization none

Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. target ORG NONE y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple DNN needs memory! target

Recurrent Neural Network (RNN) y 1 y 2 The output of hidden layer are stored in the memory. copy a1 a2 Memory can be considered as another input. x1 x2

RNN W o y 1 y 2 y 3 W o copy copy W a o 1 a 2 a 3 a 1 a 2 W i W h W W i h W i x 1 x 2 x 3 The same network is used again and again. Output y i depends on x 1, x 2, x i

RNN How to train? y 1 target y 2 target y 3 target L 1 L 2 L 3 y 1 y 2 y 3 W o W h W o W h W o W i W i x 1 x 2 x 3 W i Find the network parameters to minimize the total cost: Backpropagation through time (BPTT)

Of course it can be deep y t y t+1 y t+2 x t x t+1 x t+2

Bidirectional RNN x t x t+1 x t+2 y t y t+1 y t+2 x t x t+1 x t+2

Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. E.g. Speech Recognition Problem? Why can t it be 好棒棒 Output: Input: 好棒好好好 (character sequence) Trimming 棒棒棒棒棒 (vector sequence)

Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. Connectionist Temporal Classification (CTC) [Alex Graves, ICML 06][Alex Graves, ICML 14][Haşim Sak, Interspeech 15][Jie Li, Interspeech 15][Andrew Senior, ASRU 15] 好棒好棒棒 Add an extra symbol φ representing null 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ

Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) machine learning Containing all information about input sequence

Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機器學習慣性 machine learning Don t know when to stop

Many to Many (No Limitation) 推 tlkagk: ========= 斷 ========== Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D% E6%8E%A8%E6%96%87 ( 鄉民百科 )

感謝曾柏翔同學提供實驗結果 Unfortunately RNN-based network is not always easy to learn Real experiments on Language modeling sometimes Lucky

The error surface is rough. The error surface is either very flat or very steep. Clipping Cost w 2 w 1 [Razvan Pascanu, ICML 13]

Why? w = 1 w = 1.01 y 1000 = 1 y 1000 20000 Large gradient Small Learning rate? w = 0.99 w = 0.01 y 1000 0 y 1000 0 small gradient Large Learning rate? =w 999 Toy Example y 1 y 2 y 3 y 1000 1 1 1 1 w w w 1 1 1 1 1 0 0 0

Helpful Techniques Nesterov s Accelerated Gradient (NAG): Advance momentum method RMS Prop Advanced approach to give each parameter different learning rates Considering the change of Second derivatives Long Short-term Memory (LSTM) Can deal with gradient vanishing (not gradient explode)

Long Short-term Memory (LSTM) Signal control the output gate (Other part of the network) Signal control the input gate (Other part of the network) Other part of the network Output Gate Memory Cell Input Gate Other part of the network Forget Gate LSTM Special Neuron: 4 inputs, 1 output Signal control the forget gate (Other part of the network)

a = h c f z o z o f z o multiply h c Activation function f is usually a sigmoid function Between 0 and 1 c f z f Mimic open and close gate c c z f cf z f z i f z i g z f z i multiply g z c = g z f z i + cf z f z

Original Network: Simply replace the neurons with LSTM a 1 a 2 z 1 z 2 x 1 x 2 Input

+ a 1 a 2 + + + + + + + 4 times of parameters x 1 x 2 Input

LSTM Extension: peephole y t y t+1 c t-1 c t c t+1 + + z f z i z z o z f z i z z o c t-1 h t-1 x t c t h t x t+1

Other Simpler Alternatives Gated Recurrent Unit (GRU) Structurally Constrained Recurrent Network (SCRN) [Cho, EMNLP 14] [Tomas Mikolov, ICLR 15] Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arxiv 15] Outperform or be comparable with LSTM in 4 different tasks

What is the next wave? Attention-based Model Reading Head Reading Head Controller Internal memory or information from output Writing Head Writing Head Controller Input x DNN/LSTM output y Already applied on speech recognition, caption generation, QA, visual QA

What is the next wave? Attention-based Model End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arxiv Pre-Print, 2015. Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arxiv Pre-Print, 2014 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arxiv Pre-Print, 2015 Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et. al.. arxiv Pre-Print, 2015. Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arxiv Pre-Print, 2015. Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K. Kavukcuoglu. In NIPS, 2014. A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.

Concluding Remarks

Concluding Remarks Introduction of deep learning Discussing some reasons using deep learning New techniques for deep learning ReLU, Maxout Giving all the parameters different learning rates Dropout Network with memory Recurrent neural network Long short-term memory (LSTM)

Reading Materials Neural Networks and Deep Learning written by Michael Nielsen http://neuralnetworksanddeeplearning.com/ Deep Learning (not finished yet) Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville http://www.iro.umontreal.ca/~bengioy/dlbook/

Thank you for your attention!

Acknowledgement 感謝 Ryan Sun 來信指出投影片上的錯字

Appendix

Matrix Operation x 1 1 1-2 4 1 0.98 y 1 x2 2-1 1-1 -2 0 0.12 y σ 1 2 1 + 1 0.98 W x b = a 1 1 1 0 0.12

Why Deep? Logic Circuits A two levels of basic logic gates can represent any Boolean function. However, no one uses two levels of logic gates to build computers Using multiple layers of logic gates to build some functions are much simpler (less gates needed).

Boosting Weak classifier Input x Weak classifier Combine Weak classifier Deep Learning Weak classifier Boosted weak classifier Boosted Boosted weak classifier x 1 x 2 x N

Maxout ReLU is a special cases of Maxout Input x 1 w b z ReLU a Input x 1 0 w 0 b + z 1 Max a + z 2 max z 1, z 2 a z = wx + b x a z 1 = wx + b x z 2 =0

Maxout ReLU is a special cases of Maxout Input x 1 w b z ReLU a a Input w x w b 1 z = wx + b x b + z 1 Max a + z 2 a max z 1, z 2 Learnable Activation Function z 1 = wx + b x z 2 = w x + b