More Tips for Training Neural Network. Hung-yi Lee

Size: px

Start display at page:

Download "More Tips for Training Neural Network. Hung-yi Lee"

Ruby Cox
5 years ago
Views:

1 More Tips for Training Neural Network Hung-yi ee

2 Outline Activation Function Cost Function Data Preprocessing Training Generalization

3 Review: Training Neural Network Neural network: f ; θ : input (vector) θ: model parameters (weights, biases) Cost Function: C θ How good your model parameters θ is C θ =, y T C θ C θ = d f ; θ, y y: reference output when is input d f ; θ, y : difference between y and the output of neural network with model parameters θ Target: find θ that minimizes C θ

4 Review: Gradient descent Target: find θ that minimizing C θ C θ =, y T C θ Gradient descent Initialization: θ 0 θ t θ t η C θ t Stochastic gradient descent Pick, y from training data set T θ t θ t η C θ t C θ = C θ l w ij C θ b i l

5 Review: Backpropagation l i l ij l i l ij z w z w C C Forward Pass Backward Pass l l z a l T l l l W z l l l l b a W z y z δ C l l a j l j j i l w ij ayer l ayer l b W z z a T W z l i Error signal for backprop l z i

6 Review: Backpropagation ayer - - z z m z m W T ayer n z z z n C C y C y C y C w y n l ij z w l i l ij δ l C z l i Backward Pass l i z C y T z W l l T l z W Error signal for backprop

7 Outline Activation Function Cost Function Data Preprocessing Training Generalization

8 Problem of Sigmoid Sigmoid Function Derivative of Sigmoid Function Derivative of Sigmoid Function is always smaller than

9 Vanishing Gradient Problem Backward Pass: ayer - ayer - z z m z m W T n z z z n C C y C y C y y n For sigmoid function, σ z always smaller than Error signal is getting smaller and smaller Gradient is smaller C w l ij z w l i l ij C z l i l i

10 Vanishing Gradient Problem Input ayer ayer ayer Output y vector 3 y vector y earn very slowly Still random earn faster Already converge The weights are converged based on random!?

11 ReU PReU a a = z Rectified inear Unit (ReU) a = αz z σ z a a = z σ z a a = 0 z 0 z

12 ReU Backward Pass: ayer - - z z m z m W T ayer n z z z n C C y C y C y y n

13 ReU Backward Pass: ayer - No Attenuation - 0 m 0 W T ayer 0 n 0 C C y C y C y y n

14 ReU Backward Pass: n C y y C n y C ayer y C n n z C n nj n nj z w z w C C All the weights connected to this neuron will have zero gradient.

15 Maout ReU or PReu is just a special case of Maout Input + z + z Ma ma z, z a + z + z Ma a z = W + z 3 + z 4 Ma a + z 3 + z 4 Ma a a z z z = W a a

16 Maout ReU is special case Input w b z ReU a Input 0 w 0 b + z Ma a + z ma z, z a z = w + b a z = w + b z =0

17 Maout ReU is special case Input w b z ReU a a Input w w b z = w + b b + z Ma a + z ma z, z earnable Activation Function a z = w + b z = w + b

18 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z Ma a ma z, z + z + z Ma a + z 3 + z 3 + z 4 Ma a + z 4 Ma a a a

19 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z a + z + z a + z 3 a + z 3 + z 4 a + z 4 a In backward pass, treat it as network whose neurons having linear activation functions a

20 Outline Activation Function Cost Function Data Preprocessing Training Generalization

21 Problem of Sigmoid - z z z z C C y C y y y ŷ y ŷ C y y yˆ y n y n n ˆ arger difference, arger gradient m z m W T n z n C y n y ˆ n y n 0 arge difference z tiny derivative How about ReU? z

22 ReU is probably not suitable for output layer y = 0 0 Only one dimension is, and others are all 0 ReU More similar? y = ReU y = arger output means larger confidence Better? It is better to let the output bounded.

23 Softma Softma layer as the output layer Ordinary Output layer z y z z y z z 3 y3 z3

24 Softma Softma layer as the output layer Softma ayer Probability: > y i > 0 i y i = z z z 3 3 e z e.7-3 e z e 0 e z e 3 j 0.05 z j e 0.88 y 0. y 0 y 3 e e e z z z 3 3 j 3 j 3 j e e e z z z j j j

25 Softma Define cost function y i j e z i e z j Minimize C θ = C θ = logy t, y T C θ z z z i Softma y y y i ŷ ŷ ŷ i y = 0 0 Only one dimension is, and others are all 0 t Inde of the dimension which is

26 Softma Define cost function z z y z i i j Softma e z i e z j y y y i Minimize C θ = C θ = logy t y ŷ y y = ŷ y t ŷ i, y T y = C θ Cross Entropy 0 0 Increase y t Don t we have to decrease other dimensions?

27 y i i = t j e z i e C θ = logy t y tˆ tˆ z C z j tˆ j C y e z tˆ e tˆ z j y z tˆ tˆ yˆ t i y z y i tˆ tˆ C z i z z z i yˆ t y z t appears in both numerator and denominator tˆ Softma y tˆ y tˆ y y y i yˆ t ŷ ŷ ŷ i The absolute value of δ t is larger when y t is far from

28 y i i t y i tˆ j e z i e C θ = logy t z j C y C z j e i z tˆ tˆ e y z z j tˆ i yˆ t i y i yt ˆ zi C z i 0 z z z i yˆ t z i appears only in denominator Softma yˆ y t i y y y i yi 0 ŷ ŷ ŷ i The absolute value of δ i is larger when y i is larger

29 Outline Activation Function Cost Function Data Preprocessing Training Generalization

30 Data Preprocessing For each dimension i: mean: m i standard deviation: σ i 3 r R r i i r m i The means of all dimensions are 0, σ i and the variances are all

31 Outline Activation Function Cost Function Data Preprocessing Training Generalization

32 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

33 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

34 Problem of Gradient descent cost Very slow at the plateau Gradient is small Stuck at local minima Gradient is Zero parameter space

35 In physical world Momentum How about put this phenomenon in gradient descent?

36 Original Gradient descent C θ 0 θ 0 θ Gradient Movement C θ θ θ 3 C θ C θ 3 Start at position θ 0 Compute gradient at θ 0 Move to θ = θ 0 - μ C θ 0 Compute gradient at θ Move to θ = θ μ C θ

37 Gradient descent with Momentum C θ 0 C θ θ 0 θ Gradient Movement Momentum θ θ 3 Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ C θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ C θ 3 Momentum v = λv - μ C θ Move to θ = θ + v

38 Gradient descent with Momentum v i is actually the weighted sum of all the previous gradient: C θ 0, C θ, C θ i v 0 =0 v = - μ C θ 0 v = - λ μ C θ 0 - μ C θ Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ Move to θ = θ + v

39 Gradient descent with cost Momentum Problem Gradient Movement Momentum Gradient = 0 parameter space

40 NAG Momentum Gradient Movement Momentum Nesterov s Accelerated Gradient (NAG) Gradient = 0 Gradient = 0

41 Gradient descent, Momentum, NAG Methods: Gradient descent Momentum NAG Source:

42 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

43 earning Rates cost Very arge arge Just make small epoch Error Surface

44 earning Rates Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. At the beginning, we are far from a minimum, so we use larger learning rate After several epochs, we are close to a minimum, so we reduce the learning rate E.g. /t decay: η t = η (t + ) Not always true

45 Adaptive earning Rates Each parameter should have different learning rates Smaller earning Rate arger Gradient w w arger earning Rate Smaller earning Rate

46 RProp g t : gradient obtained with w t Each parameter should have different learning rates Gradient descent: w t+ w t ηg t If g t > 0 Another Aspect: - If g t < 0 The learning rate is η g t. wt+ w t η g t gt

47 RProp - Problem C w = E[ C w ] RProp is problematic for stochastic gradient descent Gradient descent: w t+ w t η C 0. w Stochastic Gradient descent: w t+ w t η C w

48 Adagrad Divide the learning rate by average gradient w t+ w t η σ gt σ: Average gradient of parameter w Estimated while updating the parameters If w has small average gradient σ If w has large average gradient σ arger learning rate Smaller learning rate

49 Adagrad w w 0 η σ 0 g0 σ 0 = g 0 w w η σ g σ = g0 + g w 3 w η σ g σ = 3 g0 + g + g w t+ w t η gt σt σ t = t t + i=0 g i

50 Adagrad Divide the learning rate by average gradient The average gradient is obtained while updating the parameters η t w t+ w t η σ t gt σ t = t t + i=0 g i η t = η t + /t decay w t+ w t η t i=0 g i g t

51 RMSProp Error Surface can be comple in deep learning. Smaller earning Rate w arger earning Rate w

52 RMSProp w w 0 η σ 0 g0 w w η σ g σ 0 = g 0 σ = α σ 0 + α g w 3 w η σ g σ = α σ + α g w t+ w t η gt σt σ t = α σ t + α g t

53 izing_gradient_optimization_techniques/cklhott (By Alec Radford)

54 izing_gradient_optimization_techniques/cklhott (By Alec Radford)

55 Not the whole story Adadelta ogletr0.pdf Adam AdaSecant No more pesky learning rates Using second order information

56 Outline Activation Function Cost Function Data Preprocessing Training Generalization

57 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

58 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

59 Weight Decay Our Brain

60 Weight Decay The parameters closer to zero is preferred. Training data:, yˆ, } { i w w w i z z w yˆ z Testing data:, yˆ, } { z' To minimize the effect of noise, we want w close to zero. w w w z w

61 Weight Decay New cost function to be minimized Find a set of weight not only minimizing original cost but also close to zero C C Original cost (e.g. minimize square error, cross entropy ) Regularization term: W, W, w w w w (not consider biases. why?)

62 Weight Decay New cost function to be minimized Gradient: w w w C C Update: w w w t t C t t w w w C w t w C Smaller and smaller w w w w C C

63 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

64 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) In each iteration Each neuron has p% to dropout

65 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) Thinner! In each iteration Each neuron has p% to dropout The structured of the network is changed. Using the new network for training For each iteration, we resample the dropout neurons

66 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (-p)% Assume that the dropout rate is 50%. l l If w ij = from training, set w ij = 0.5 for testing.

67 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone epect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

68 Dropout - Intuitive Reason Why the weights should multiply (-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w w w 3 w 4 z No dropout w w w 3 w 4 Weights from training z z z Weights multiply (-p)% z z

69 Dropout - Ensemble Ensemble Training Set Set Set Set 3 Set 4 Network Network Network 3 Network 4 Train a bunch of networks with different structures

70 Dropout - Ensemble Ensemble Testing data Network Network Network 3 Network 4 y y y 3 y 4 average

71 Dropout - Ensemble Dropout Ensemble. Training of Dropout 3 4 M neurons M possible networks Using one data to train one network Some parameters in the network are shared

72 Dropout - Ensemble Testing of Dropout testing data Dropout Ensemble. All the weights multiply (-p)% y y y 3 average? y

73 Dropout - Ensemble Eperiments on hand writing digital classification Ref:

74 Practical Suggestion for Dropout arger network If you know your task need n neurons, for dropout rate p, your network need n/(-p) neurons. onger training time Higher learning rate arger momentum

75 Not the whole story To learn more about the theory of dropout Baldi, Pierre, and Peter J. Sadowski. "Understanding dropout." Advances in Neural Information Processing Systems. 03. DropConnect Reference: Wan,., Zeiler, M., Zhang, S., Cun, Y.., & Fergus, R. (03). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine earning (ICM-3) Standout Reference: Ba, J., & Frey, B. (03). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems

76 Concluding Remarks Outlook: How to connect neurons Parameters Initialization Activation Function Cost Function Data Preprocessing Training Generalization ReU, Maout Softma + Cross Entropy Mean=0, Variance= Momentum, NAG, Adagrad, RMSProp Weight decay, Dropout

77 Acknowledgement 感謝賴威昇同學糾正投影片上的錯誤

78 Appendi

79 Optimizaton method ahiriprochnowng0.pdf For RNN kever3.pdf

80 Kludge? The smallness of the weights means that the behaviour of the network won't change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Simple is not always better If one had judged merely on the grounds of simplicity, then some modified form of Newton's theory would arguably have been more attractive. the dynamics of gradient descent learning in multilayer nets has a `self-regularization' effect"

81 Use genetic approach

82 Story of NN ents/5lnbt/ama_yann_lecun/chivdv7

83 Reference Everything about tuing parameters

84 Activation Functions a Softplus: a ln e z a ma a e z Softsign: a a e e z z e e z z z

85 RProp Each parameter has different learning rate θ i θ i η i C θ i If at t- and t-th iterations, C θ i has the same sign Otherwise: - η i : parameter dependent η i η i 0.5 η i η i. (decrease) If C θ i > 0 If C θ i < 0 (increase)

86 RProp Original Gradient descent RProp 0.0 ~ 0.3k updates updates

87 Adadelta

88 MNIST TIMIT

89 Error Surface: Square Error v.s. Cross Entropy Red: Square Error Black: Cross Entropy Cost w w Ref: Glorot, X., & Bengio, Y. (00). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics

90 Reference Mean Variance Decorrelation

91 ReU Backward Pass: l δ ayer l l z l z ayer l + δ l z z l l ayer - - z z ayer z z C C y C y y i l z i l W T k l z k m z m W T n z n C y n

92 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ First make a big jump in the direction of the previous accumulated gradient. Move to θ = θ + v Then measure the gradient where you end up and make a correction. Its better to correct a mistake after you have made it! Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3

93 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Momentum v =0- μ C θ 0 Move to θ = θ 0 + v Move to θ = θ + λv Momentum v = λv - μ C θ Move to θ = θ + v Momentum v 3 = λv - μ C θ + Move to θ 3 = θ + v 3

94 Why Zero Mean? w C w j j C z l i the same for all j Input w ayer The inputs are always positive j w j i w w Sigmoid <? Hyperbolic Tangent Initial: target:

Tips for Deep Learning

Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results