Deep learning attracts lots of attention.

Size: px

Start display at page:

Download "Deep learning attracts lots of attention."

Samuel Osborne
5 years ago
Views:

1 Deep Learning

2 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean

3 Ups and downs of Deep Learning 958: Perceptron (linear model) 969: Perceptron has limitation 980s: Multi-layer perceptron Do not have significant difference from DNN today 986: Backpropagation Usually more than 3 hidden layers is not helpful 989: hidden layer is good enough, why deep? 2006: RBM initialization (breakthrough) 2009: GPU 20: Start to be popular in speech recognition 202: win ILSVRC image competition

4 Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

5 Neural Network z z z z Neuron Neural Network Different connection leads to different network structures Network parameter θ: all the weights and biases in the neurons

6 Fully Connect Feedforward Network Sigmoid Function z e z z z

7 Fully Connect Feedforward Network

8 Fully Connect Feedforward Network This is a function. Input vector, output vector f = f 0 0 = Given network structure, define a function set

9 Fully Connect Feedforward Network neuron Input x Layer Layer 2 Layer L Output y x 2 y 2 x N y M Input Layer Hidden Layers Output Layer

10 Deep = Many hidden layers 22 layers du/slides/winter56_le cture8.pdf 9 layers 6.4% 8 layers 7.3% 6.7% AlexNet (202) VGG (204) GoogleNet (204)

11 Deep = Many hidden layers 0 layers 52 layers Special structure 3.57% 6.4% AlexNet (202) 7.3% 6.7% VGG (204) GoogleNet (204) Residual Net (205) Taipei 0

12 Matrix Operation y y 2 σ =

13 Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M σ W x + b σ W 2 a + b 2 σ W L a L- + b L

14 Neural Network x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W x + b + b 2 + b L

15 Output Layer Feature extractor replacing feature engineering x y x x 2 x K Softmax y 2 y M Input Layer Hidden Layers Output Layer = Multi-class Classifier

16 Example Application Input Output x y 0. is 6 x 6 = 256 Ink No ink 0 x 2 x 256 y y is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

17 Example Application Handwriting Digit Recognition x y is x 2 y 2 Machine Neural 2 Network is 2 x 256 What is needed is a function Input: 256-dim vector y 0 is 0 output: 0-dim vector

18 Example Application Input x Layer Layer 2 Layer L Output y is x 2 x N Input Layer A function set containing the candidates for Handwriting Digit Recognition Hidden Layers Output Layer y 2 2 You need to decide the network structure to let a good function in your function set. y 0 is 2 is 0

19 FAQ Q: How many layers? How many neurons for each layer? Trial and Error + Intuition Q: Can the structure be automatically determined? E.g. Evolutionary Artificial Neural Networks Q: Can we design the network structure? Convolutional Neural Network (CNN)

20 Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

21 Loss for an Example target x y y x 2 x 256 Given a set of parameters 0 C y, y = i= y i lny i Softmax y 2 y 0 y Cross Entropy y 2 0 y 0 y 0

N y N Find a function in function set that minimizes total

22 Total Loss For all training data Total Loss: N L = n= C n x x 2 x N NN NN NN y y 2 y N C C 2 y y 2 x 3 NN y 3 y 3 C 3 C N y N Find a function in function set that minimizes total loss L Find the network parameters θ that minimize total loss L

23 Three Steps for Deep Learning Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

24 Gradient Descent θ Compute LΤ w w μ LΤ w Compute LΤ w 2 w μ LΤ w 2 Compute LΤ b b μ LΤ b L = L w L w 2 L b gradient

25 Gradient Descent θ Compute LΤ w w μ LΤ w Compute LΤ w 2 w μ LΤ w 2 Compute LΤ b b μ LΤ b Compute μ Compute μ Compute μ Τ L w 0.09 Τ L w Τ L w Τ L w 2 Τ L b 0.0 Τ L b

26 Gradient Descent This is the learning of machines in deep learning Even alpha go using this approach. People image Actually.. I hope you are not too disappointed :p

27 Backpropagation Backpropagation: an efficient way to compute neural network Τ L w in libdnn 台大周伯威同學開發 Ref: ackprop.ecm.mp4/index.html

28 Concluding Remarks Step : define Neural a set of Network function Step 2: goodness of function Step 3: pick the best function What are the benefits of deep architecture?

29 Deeper is Better? Layer X Size Word Error Rate (%) X 2k X 2k X 2k X 2k 7.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k 7.2 X X 2k 7. X X 6k 22. Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 20.

30 Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: eplearning.com/chap4.html Why Deep neural network not Fat neural network? (next lecture)

31 深度學習深度學習 My Course: Machine learning and having it deep and structured html 6 hour version: Neural Networks and Deep Learning written by Michael Nielsen Deep Learning written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

32 Acknowledgment 感謝 Victor Chen 發現投影片上的打字錯誤

Deep Learning. Hung-yi Lee 李宏毅

Deep Learning. Hung-yi Lee 李宏毅 Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron