Tips for Deep Learning

Size: px

Start display at page:

Download "Tips for Deep Learning"

Tracy McDowell
6 years ago
Views:

1 Tips for Deep Learning

2 Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results on Training Data? Neural Network

3 Do not always blame Overfitting Not well trained Overfitting? Training Data Testing Data Deep Residual Learning for Image Recognition

4 Recipe of Deep Learning YES Different approaches for different problems. e.g. dropout for good results on testing data Good Results on Testing Data? YES Good Results on Training Data? Neural Network

5 Recipe of Deep Learning Early Stopping Regularization Dropout New activation function YES Good Results on Testing Data? YES Good Results on Training Data? Adaptive Learning Rate

6 Hard to get the power of Deep Results on Training Data Deeper usually does not imply better.

7 Vanishing Gradient Problem x x y y x N y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?

8 Vanishing Gradient Problem Smaller gradients x x Small output y y y y x N + w y M C + C Large input y M Intuitive way to compute the derivatives C w =? C w

9 ReLU Rectified Linear Unit (ReLU) σ z a = 0 a a = z [Xavier Glorot, AISTATS ] [Andrew L. Maas, ICML 3] [Kaiming He, arxiv 5] z Reason:. Fast to compute. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

10 ReLU a a = z 0 a = 0 z x x 0 0 y y 0

11 ReLU a a = z A Thinner linear network a = 0 z x y x y Do not have smaller gradients

12 ReLU - variant Leaky ReLU a a = z Parametric ReLU a a = z a = 0.0z z a = αz z α also learned by gradient descent

13 Maxout ReLU is a special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 3] Input x neuron Max Max x Max + 3 Max 4 You can have more than elements in a group.

14 Maxout ReLU is a special cases of Maxout Input x w b z ReLU a Input x 0 w 0 b + z Max a + z max z, z a z = wx + b x a z = wx + b x z =0

15 Maxout More than ReLU Input x w b z ReLU a a Input w x w b z = wx + b x b + z Max a + z a max z, z Learnable Activation Function z = wx + b x z = w x + b

16 Maxout Learnable activation function [Ian J. Goodfellow, ICML 3] Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group elements in a group 3 elements in a group

17 Maxout - Training Given a training data x, we know which z would be the max Input x + z + z Max a max z, z + z + z Max a x + z 3 + z 3 x + z 4 Max a + z 4 Max a a a

18 Maxout - Training Given a training data x, we know which z would be the max Input x + z + z a + z + z a x + z 3 + z 3 a x + z 4 a + z 4 a Train this thin and linear network Different thin and linear network for different examples a

19 Recipe of Deep Learning Early Stopping Regularization Dropout New activation function YES Good Results on Testing Data? YES Good Results on Training Data? Adaptive Learning Rate

20 Review Smaller Learning Rate w Larger Learning Rate Adagrad w t+ w t η σt i=0 w g i g t Use first derivative to estimate second derivative

21 RMSProp Error Surface can be very complex when training NN. Smaller Learning Rate w Larger Learning Rate w

22 RMSProp w w 0 η σ 0 g0 w w η σ g σ 0 = g 0 σ = α σ 0 + α g w 3 w η σ g σ = α σ + α g w t+ w t η gt σt σ t = α σ t + α g t Root Mean Square of the gradients with previous gradients being decayed

23 Hard to find optimal network parameters Total Loss Very slow at the plateau Stuck at saddle point Stuck at local minima L w 0 L w = 0 L w = 0 The value of a network parameter w

24 In physical world Momentum How about put this phenomenon in gradient descent?

25 Review: Vanilla Gradient Descent L θ 0 θ 0 θ Gradient Movement L θ θ θ 3 L θ L θ 3 Start at position θ 0 Compute gradient at θ 0 Move to θ = θ 0 - η L θ 0 Compute gradient at θ Move to θ = θ η L θ Stop until L θ t 0

26 Momentum Movement: movement of last step minus gradient at present L θ 0 L θ θ 0 θ Gradient Movement Movement of last step θ θ 3 L θ L θ 3 Start at point θ 0 Movement v 0 =0 Compute gradient at θ 0 Movement v = λv 0 - η L θ 0 Move to θ = θ 0 + v Compute gradient at θ Movement v = λv - η L θ Move to θ = θ + v Movement not just based on gradient, but previous movement.

Momentum Movement: movement of last step minus gradient at present v i is actually the weighted sum of all the previous gradient: L θ 0, L θ, L θ i v 0 = 0 v = - η L θ 0 v = - λ η L θ 0 - η L θ Start

27 Momentum Movement: movement of last step minus gradient at present v i is actually the weighted sum of all the previous gradient: L θ 0, L θ, L θ i v 0 = 0 v = - η L θ 0 v = - λ η L θ 0 - η L θ Start at point θ 0 Movement v 0 =0 Compute gradient at θ 0 Movement v = λv 0 - η L θ 0 Move to θ = θ 0 + v Compute gradient at θ Movement v = λv - η L θ Move to θ = θ + v Movement not just based on gradient, but previous movement

28 Momentum cost Still not guarantee reaching global minima, but give some hope Movement = Negative of L w + Momentum Negative of L w Momentum Real Movement L w = 0

29 Adam RMSProp + Momentum for momentum for RMSprop

30 Recipe of Deep Learning Early Stopping Regularization Dropout New activation function YES Good Results on Testing Data? YES Good Results on Training Data? Adaptive Learning Rate

31 Early Stopping Total Loss Stop at here Validation set Testing set Epochs Training set Keras:

32 Recipe of Deep Learning Early Stopping Regularization Dropout New activation function YES Good Results on Testing Data? YES Good Results on Training Data? Adaptive Learning Rate

33 Regularization New loss function to be minimized Find a set of weight not only minimizing original cost but also close to zero L L Original loss (e.g. minimize square error, cross entropy ) Regularization term w,, w L regularization: w w (usually not consider biases)

34 Regularization New loss function to be minimized Gradient: w w w L L Update: w w w t t L t t w w w L w t w L Closer to zero L L Weight Decay w w L regularization:

35 Regularization L regularization: w w New loss function to be minimized L L Update: L w L w sgn w t t L t L w w t w sgn w w w t L t w sgn w Always delete w L w t L w

36 Regularization - Weight Decay Our brain prunes out the useless link between neurons. Doing the same thing to machine s brain improves the performance.

37 Recipe of Deep Learning Early Stopping Regularization Dropout New activation function YES Good Results on Testing Data? YES Good Results on Training Data? Adaptive Learning Rate

38 Dropout Training: Each time before updating the parameters Each neuron has p% to dropout

39 Dropout Training: Thinner! Each time before updating the parameters Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons

40 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times -p% Assume that the dropout rate is 50%. If a weight w = by training, set w = 0.5 for testing.

41 Dropout - Intuitive Reason Training Dropout ( 腳上綁重物 ) Testing No dropout ( 拿下重物後就變很強 )

42 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

43 Dropout - Intuitive Reason Why the weights should multiply (-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w w w 3 w 4 z No dropout w w w 3 w 4 Weights from training z z z Weights multiply -p% z z

44 Dropout is a kind of ensemble. Ensemble Training Set Set Set Set 3 Set 4 Network Network Network 3 Network 4 Train a bunch of networks with different structures

45 Dropout is a kind of ensemble. Ensemble Testing data x Network Network Network 3 Network 4 y y y 3 y 4 average

46 Dropout is a kind of ensemble. minibatch minibatch minibatch 3 minibatch 4 Training of Dropout M neurons M possible networks Using one mini-batch to train one network Some parameters in the network are shared

47 Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply -p% y y y 3 average????? y

48 Testing of Dropout x x w w x x w w x x w w z=w x +w x z=w x +w x z=w x x x x x w w w w x x w w z=w x z=0 z = w x + w x

49 Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results on Training Data? Neural Network

Tips for Deep Learning

Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results