Deep learning attracts lots of attention.

Deep Learning

Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean

Ups and downs of Deep Learning 958: Perceptron (linear model) 969: Perceptron has limitation 980s: Multi-layer perceptron Do not have significant difference from DNN today 986: Backpropagation Usually more than 3 hidden layers is not helpful 989: hidden layer is good enough, why deep? 2006: RBM initialization (breakthrough) 2009: GPU 20: Start to be popular in speech recognition 202: win ILSVRC image competition

Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Neural Network z z z z Neuron Neural Network Different connection leads to different network structures Network parameter θ: all the weights and biases in the neurons

Fully Connect Feedforward Network -2 4 0.98 - - -2 0 0.2 Sigmoid Function z e z z z

Fully Connect Feedforward Network -2 4 0.98 2-0 0.86 3 - -2 0.62 - - -2 0 0.2-2 - 0 0. - 4 2 0.83

Fully Connect Feedforward Network 0-2 0.73 2-0 0.72 3 - -2 0.5 0-0 0.5 - -2 0 0.2 4-2 0.85 This is a function. Input vector, output vector f = 0.62 0.83 f 0 0 = 0.5 0.85 Given network structure, define a function set

Fully Connect Feedforward Network neuron Input x Layer Layer 2 Layer L Output y x 2 y 2 x N y M Input Layer Hidden Layers Output Layer

Deep = Many hidden layers 22 layers http://cs23n.stanford.e du/slides/winter56_le cture8.pdf 9 layers 6.4% 8 layers 7.3% 6.7% AlexNet (202) VGG (204) GoogleNet (204)

Deep = Many hidden layers 0 layers 52 layers Special structure 3.57% 6.4% AlexNet (202) 7.3% 6.7% VGG (204) GoogleNet (204) Residual Net (205) Taipei 0

Matrix Operation -2 4 0.98 y - - -2 0 0.2 y 2 σ 2 + 0 = 0.98 0.2 4 2

Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M σ W x + b σ W 2 a + b 2 σ W L a L- + b L

Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W x + b + b 2 + b L

Output Layer Feature extractor replacing feature engineering x y x x 2 x K Softmax y 2 y M Input Layer Hidden Layers Output Layer = Multi-class Classifier

Example Application Input Output x y 0. is 6 x 6 = 256 Ink No ink 0 x 2 x 256 y 2 0.7 y 0 0.2 is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

Example Application Handwriting Digit Recognition x y is x 2 y 2 Machine Neural 2 Network is 2 x 256 What is needed is a function Input: 256-dim vector y 0 is 0 output: 0-dim vector

Example Application Input x Layer Layer 2 Layer L Output y is x 2 x N Input Layer A function set containing the candidates for Handwriting Digit Recognition Hidden Layers Output Layer y 2 2 You need to decide the network structure to let a good function in your function set. y 0 is 2 is 0

FAQ Q: How many layers? How many neurons for each layer? Trial and Error + Intuition Q: Can the structure be automatically determined? E.g. Evolutionary Artificial Neural Networks Q: Can we design the network structure? Convolutional Neural Network (CNN)

Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Loss for an Example target x y y x 2 x 256 Given a set of parameters 0 C y, y = i= y i lny i Softmax y 2 y 0 y Cross Entropy y 2 0 y 0 y 0

Total Loss For all training data Total Loss: N L = n= C n x x 2 x N NN NN NN y y 2 y N C C 2 y y 2 x 3 NN y 3 y 3 C 3 C N y N Find a function in function set that minimizes total loss L Find the network parameters θ that minimize total loss L

Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Gradient Descent θ Compute LΤ w w 0.2 0.5 μ LΤ w Compute LΤ w 2 w 2-0. 0.05 μ LΤ w 2 Compute LΤ b b 0.3 0.2 μ LΤ b L = L w L w 2 L b gradient

Gradient Descent θ Compute LΤ w w 0.2 0.5 μ LΤ w Compute LΤ w 2 w 2-0. 0.05 μ LΤ w 2 Compute LΤ b b 0.3 0.2 μ LΤ b Compute μ Compute μ Compute μ Τ L w 0.09 Τ L w Τ L w 2 0.5 Τ L w 2 Τ L b 0.0 Τ L b

Gradient Descent This is the learning of machines in deep learning Even alpha go using this approach. People image Actually.. I hope you are not too disappointed :p

Backpropagation Backpropagation: an efficient way to compute neural network Τ L w in libdnn 台大周伯威同學開發 Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_205_2/lecture/dnn%20b ackprop.ecm.mp4/index.html

Concluding Remarks Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function What are the benefits of deep architecture?

Deeper is Better? Layer X Size Word Error Rate (%) X 2k 24.2 2 X 2k 20.4 3 X 2k 8.4 4 X 2k 7.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k 7.2 X 3772 22.5 7 X 2k 7. X 4634 22.6 X 6k 22. Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 20.

Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: http://neuralnetworksandde eplearning.com/chap4.html Why Deep neural network not Fat neural network? (next lecture)

深度學習深度學習 My Course: Machine learning and having it deep and structured http://speech.ee.ntu.edu.tw/~tlkagk/courses_mlsd5_2. html 6 hour version: http://www.slideshare.net/tw_dsconf/ss- 6224535 Neural Networks and Deep Learning written by Michael Nielsen http://neuralnetworksanddeeplearning.com/ Deep Learning written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville http://www.deeplearningbook.org

Acknowledgment 感謝 Victor Chen 發現投影片上的打字錯誤