Based on the original slides of Hung-yi Lee

Size: px

Start display at page:

Download "Based on the original slides of Hung-yi Lee"

Natalie Charles
5 years ago
Views:

1 Based on the original slides of Hung-yi Lee

2 Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services need to learn from raw data. Classification and Prediction

3 Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory

4 What people already knew in 980s

5 Handwriting Digit Recognition Machine 2

6 INPUT OUTPUT x y 0. is 6 x 6 = 256 Ink No ink 0 x 2 x 256 y y is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

7 Handwriting Digit Recognition x y x 2 y 2 Machine 2 x 256 f: R 256 R 0 y 0 In deep learning, the function f is represented by neural network

8 Neuron f: R K R a w z a w a 2 w 2 a K w K b a 2 w 2 w K z z a a K weights b bias Activation function

9 neuron Input x Layer Layer 2 Layer L Output y x 2 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers

10 Sigmoid Function z e z z z

12 f: R 2 R 2 f = f 0 0 = Different parameters define different function

13 y y 2 σ =

14 x y x 2 x N W W 2 W L b b 2 x a a 2 y b L y 2 y M σ W x + b σ W 2 a + b 2 σ W L a L- + b L

15 x y x 2 x N W W 2 W L b b 2 b L x a a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W x + b + b 2 + b L

16 Softmax layer as the output layer Ordinary Layer z z 2 y y z 2 z 2 In general, the output of network can be any value. z 3 y 3 z 3 May not be easy to interpret

17 Softmax layer as the output layer Softmax Layer Probability: > y i > 0 y i = i z z 2 z 3 3 e e e e z 20 e z 2 e z 3 3 j 0.05 z j e 0.88 y 0.2 y 0 y 2 3 e e e z z z j 3 j 3 j e e e z z z j j j

2 y 0 Set the network parameters θ such that Input: Input:

18 x θ = W, b, W 2, b 2, W L, b L y 0. is 6 x 6 = 256 Ink No ink 0 x 2 x 256 y y 0 Set the network parameters θ such that Input: Input: Softmax How to let the neural network achieve this is 2 is 0 y has the maximum value y 2 has the maximum value

Preparing training data: images and their labels 5 0 4 9

19 Preparing training data: images and their labels Using the training data to find the network parameters.

20 Given a set of network parameters θ, each example has a cost value. x y 0.2 x 2 x 256 y y Cost L(θ) 0 0 Cost can be Euclidean distance or cross entropy of the network output and target target

21 For all training data x x 2 x R NN NN NN y y 2 L θ L 2 θ y y 2 x 3 NN y 3 y 3 L 3 θ y R y R L R θ Total Cost: R C θ = L r θ r= How bad the network parameters θ is on this task Find the network parameters θ that minimize this value

θ = w, w 2 Randomly pick a starting point θ 0 w 2 η C θ 0 θ 0 C θ 0 θ C

22 Error Surface The colors represent the value of C. Assume there are only two parameters w and w 2 in a network. θ = w, w 2 Randomly pick a starting point θ 0 w 2 η C θ 0 θ 0 C θ 0 θ C θ 0 = C θ0 / w C θ 0 / w 2 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w

23 w 2 η C θ θ 2 θ 0 Eventually, we would reach a minima.. η C θ 2 θ C C θ θ 2 Randomly pick a starting point θ 0 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w

minima, so different results Who is Afraid of

24 Gradient descent never guarantee global minima Different initial point θ 0 C w w 2 Reach different minima, so different results Who is Afraid of Non-Convex Loss Functions? _lecun_wia/

25 cost Very slow at the plateau Stuck at saddle point Stuck at local minima C θ 0 C θ = 0 parameter space C θ = 0

26 Momentum How about put this phenomenon in gradient descent?

27 cost Still not guarantee reaching global minima, but give some hope Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0

28 Mini-batch Mini-batch x NN y y L x 3 NN y 3 y 3 x 2 NN L 3 y 2 y 2 L 2 x 6 NN y 6 y 6 L 6 Randomly initialize θ 0 Pick the st batch C = L + L 3 + θ θ 0 η C θ 0 Pick the 2 nd batch C = L 2 + L 6 + θ 2 θ η C θ C is different each time when we update parameters!

29 Original Gradient Descent With Mini-batch unstable The colors represent the total C on all training data.

30 Mini-batch Mini-batch Faster Better! x NN y y C x 3 NN y 3 y 3 x 2 NN C 3 y 2 y 2 C 2 x 6 NN y 6 y 6 C 6 Randomly initialize θ 0 Pick the st batch C = C + C 3 + θ θ 0 η C θ 0 Pick the 2 nd batch C = C 2 + C 6 + θ 2 θ η C θ Until all mini-batches have been picked one epoch Repeat the above process

edu.tw/~tlkagk/courses/mlds_205_ 2/Lecture/DNN%20backprop.ecm.mp4/index.

31 A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently Ref: 2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automatically Ref: ture/theano%20dnn.ecm.mp4/index.html

33 Layer X Size Word Error Rate (%) X 2k X 2k X 2k X 2k 7.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k 7.2 X X 2k 7. X X 6k 22. Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 20.

34 Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: eplearning.com/chap4.html Why Deep neural network not Fat neural network?

35 The same number of parameters Which one is better? x x 2 xn Shallow x x 2 xn Deep

36 Layer X Size Word Error Rate (%) X 2k X 2k X 2k X 2k 7.8 Layer X Size Word Error Rate (%) 5 X 2k 7.2 X X 2k 7. X X 6k 22. Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 20.

37 can be trained by little data Deep Modularization Classifier Girls with long hair Image Boy or Girl? Basic Classifie Long r or short? Sharing by the following classifiers as module Classifier 2 fine Classifier 3 Classifier 4 Boys with long Little hair data Girls with short hair Boys with short hair

38 Deep Modularization x Deep Learning also works on small data set like TIMIT. Less training data? x 2 x N The modularization is automatically learned from data. The most basic classifiers Use st layer as module to build classifiers Use 2 nd layer as module

Hand-crafted kernel function SVM Deep Learning x Apply simple classifier Source of image: http://www.gipsa-lab.

39 Hand-crafted kernel function SVM Deep Learning x Apply simple classifier Source of image: Learnable kernel φ x simple classifier y x x 2 y 2 x N y M

40 A two levels of basic logic gates can represent any Boolean function. However, no one uses two levels of logic gates to build computers Using multiple layers of logic gates to build some functions are much simpler (less gates needed).

41 Boosting Deep Learning x x 2 Input x Weak classifier Weak classifier Weak classifier Weak classifier Boosted weak classifier Combine Boosted Boosted weak classifier x N

42 Before 2006, deeper usually does not imply better.

45 Don t forget! Modify the Network Better optimization Strategy overfittin g Preventing Overfitting

46 Modify the Network New activation functions, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rates Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.

47 Neural Networks and Deep Learning written by Michael Nielsen Deep Learning (not finished yet) Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory Part I: Introduction