Deep Learning Hung-yi Lee 李宏毅
Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean
958: Perceptron (linear model) 969: Perceptron has limitation 980s: Multi-layer perceptron Do not have significant difference from DNN today 986: Backpropagation Usually more than 3 hidden layers is not helpful 989: hidden layer is good enough, why deep? 2006: RBM initialization 2009: GPU Ups and downs of Deep Learning 20: Start to be popular in speech recognition 202: win ILSVRC image competition 205.2: Image recognition surpassing human-level performance 206.3: Alpha GO beats Lee Sedol 206.0: Speech recognition system as good as humans
Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple
Neural Network z z z z Neuron Neural Network Different connection leads to different network structures Network parameter θ: all the weights and biases in the neurons
Fully Connect Feedforward Network -2 4 0.98 - - -2 0 0.2 Sigmoid Function z e z z z
Fully Connect Feedforward Network -2 4 0.98 2-0 0.86 3 - -2 0.62 - - -2 0 0.2-2 - 0 0. - 4 2 0.83
Fully Connect Feedforward Network 0-2 0.73 2-0 0.72 3 - -2 0.5 0-0 0.5 - -2 0 0.2 4-2 0.85 This is a function. Input vector, output vector f = 0.62 0.83 f 0 0 = 0.5 0.85 Given network structure, define a function set
Fully Connect Feedforward Network neuron Input x Layer Layer 2 Layer L Output y x 2 y 2 x N y M Input Layer Hidden Layers Output Layer
Deep = Many hidden layers 22 layers http://cs23n.stanford.e du/slides/winter56_le cture8.pdf 9 layers 6.4% 8 layers 7.3% 6.7% AlexNet (202) VGG (204) GoogleNet (204)
Deep = Many hidden layers 52 layers 0 layers Special structure Ref: https://www.youtube.com/watch?v= dxb6299gpvi 3.57% 6.4% AlexNet (202) 7.3% 6.7% VGG (204) GoogleNet (204) Residual Net (205) Taipei 0
Matrix Operation -2 4 0.98 y - - -2 0 0.2 y 2 σ 2 + 0 = 0.98 0.2 4 2
Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M σ W x + b σ W 2 a + b 2 σ W L a L- + b L
Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W x + b + b 2 + b L
Output Layer as Multi-Class Classifier Feature extractor replacing feature engineering x y x x 2 x K Softmax y 2 y M Input Layer Hidden Layers Output Layer = Multi-class Classifier
Example Application Input Output x y 0. is 6 x 6 = 256 Ink No ink 0 x 2 x 256 y 2 0.7 y 0 0.2 is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.
Example Application Handwriting Digit Recognition x y is x 2 y 2 Machine Neural 2 Network is 2 x 256 What is needed is a function Input: 256-dim vector y 0 is 0 output: 0-dim vector
Example Application Input x Layer Layer 2 Layer L Output y is x 2 x N Input Layer A function set containing the candidates for Handwriting Digit Recognition Hidden Layers Output Layer y 2 2 You need to decide the network structure to let a good function in your function set. y 0 is 2 is 0
FAQ Q: How many layers? How many neurons for each layer? Trial and Error + Intuition Q: Can the structure be automatically determined? E.g. Evolutionary Artificial Neural Networks Q: Can we design the network structure? Convolutional Neural Network (CNN)
Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple
Loss for an Example target x y y x 2 x 256 Given a set of parameters 0 l y, y = i= y i lny i Softmax y 2 y 0 y Cross Entropy y 2 0 y 0 y 0
Total Loss For all training data Total Loss: N L = n= l n x x 2 x N NN NN NN y y 2 y N l l 2 y y 2 x 3 NN y 3 y 3 l 3 l N y N Find a function in function set that minimizes total loss L Find the network parameters θ that minimize total loss L
Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple
Gradient Descent θ Compute LΤ w w 0.2 0.5 μ LΤ w Compute LΤ w 2 w 2-0. 0.05 μ LΤ w 2 Compute LΤ b b 0.3 0.2 μ LΤ b L = L w L w 2 L b gradient
Gradient Descent θ Compute LΤ w w 0.2 0.5 μ LΤ w Compute LΤ w 2 w 2-0. 0.05 μ LΤ w 2 Compute LΤ b b 0.3 0.2 μ LΤ b Compute μ Compute μ Compute μ Τ L w 0.09 Τ L w Τ L w 2 0.5 Τ L w 2 Τ L b 0.0 Τ L b
Gradient Descent This is the learning of machines in deep learning Even alpha go using this approach. People image Actually.. I hope you are not too disappointed :p
Backpropagation Backpropagation: an efficient way to compute neural network Τ L w in libdnn 台大周伯威同學開發 Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_205_2/lecture/dnn%20b ackprop.ecm.mp4/index.html
Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple
Acknowledgment 感謝 Victor Chen 發現投影片上的打字錯誤