Neural Networks: Optimization & Regularization

Size: px
Start display at page:

Download "Neural Networks: Optimization & Regularization"

Transcription

1 Neural Networks: Optimization & Regularization Shan-Hung Wu Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 1 / 60

2 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 2 / 60

3 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 3 / 60

4 Challenges NN a complex function: ŷ = f (x;q) = f (L) ( f (1) (x;w (1) );W (L) ) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60

5 Challenges NN a complex function: ŷ = f (x;q) = f (L) ( f (1) (x;w (1) );W (L) ) Given a training set X, ourgoalistosolve: argmin Q C(Q)=argmin Q logp(x Q) = argmin Q Â i logp(y (i) x (i),q) = argmin Q Â i C (i) (Q) = argmin W (1),,W (L) Â i C (i) (W (1),,W (L) ) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60

6 Challenges NN a complex function: ŷ = f (x;q) = f (L) ( f (1) (x;w (1) );W (L) ) Given a training set X, ourgoalistosolve: argmin Q C(Q)=argmin Q logp(x Q) = argmin Q Â i logp(y (i) x (i),q) = argmin Q Â i C (i) (Q) = argmin W (1),,W (L) Â i C (i) (W (1),,W (L) ) What are the challenges of solving this problem with SGD? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 4 / 60

7 Non-Convexity The loss function C (i) is non-convex Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60

8 Non-Convexity The loss function C (i) is non-convex SGD stops at local minima or saddle points Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60

9 Non-Convexity The loss function C (i) is non-convex SGD stops at local minima or saddle points Prior to the success of SGD (in roughly 2012), NN cost function surfaces were generally believed to have many non-convex structure Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60

10 Non-Convexity The loss function C (i) is non-convex SGD stops at local minima or saddle points Prior to the success of SGD (in roughly 2012), NN cost function surfaces were generally believed to have many non-convex structure However, studies [2, 4] show SGD seldom encounters critical points when training a large NN Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 5 / 60

11 Ill-Conditioning The loss C (i) may be ill-conditioned (in terms of Q) Due to, e.g., dependency between W (k) s at different layers Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 6 / 60

12 Ill-Conditioning The loss C (i) may be ill-conditioned (in terms of Q) Due to, e.g., dependency between W (k) s at different layers SGD has slow progress at valleys or plateaus Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 6 / 60

13 Lacks Global Minima The loss C (i) may lack a global minimum point Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60

14 Lacks Global Minima The loss C (i) may lack a global minimum point E.g., for multiclass classification P(y x,q) provided by a softmax function Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60

15 Lacks Global Minima The loss C (i) may lack a global minimum point E.g., for multiclass classification P(y x,q) provided by a softmax function C (i) (Q)= logp(y (i) x (i),q) can become arbitrarily close to zero (if classifying example i correctly) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60

16 Lacks Global Minima The loss C (i) may lack a global minimum point E.g., for multiclass classification P(y x,q) provided by a softmax function C (i) (Q)= logp(y (i) x (i),q) can become arbitrarily close to zero (if classifying example i correctly) But not actually reaching zero Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60

17 Lacks Global Minima The loss C (i) may lack a global minimum point E.g., for multiclass classification P(y x,q) provided by a softmax function C (i) (Q)= logp(y (i) x (i),q) can become arbitrarily close to zero (if classifying example i correctly) But not actually reaching zero SGD may proceed along a direction forever Initialization is important Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 7 / 60

18 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

19 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

20 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning When training, remember to: 1 Initialize all weights to small random values Breaks symmetry between different units so they are not updated in the same way Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

21 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning When training, remember to: 1 Initialize all weights to small random values Breaks symmetry between different units so they are not updated in the same way Biases b (k) s may be initialized to zero Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

22 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning When training, remember to: 1 Initialize all weights to small random values Breaks symmetry between different units so they are not updated in the same way Biases b (k) s may be initialized to zero (or to small positive values for ReLUs to prevent too much saturation) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

23 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning When training, remember to: 1 Initialize all weights to small random values Breaks symmetry between different units so they are not updated in the same way Biases b (k) s may be initialized to zero (or to small positive values for ReLUs to prevent too much saturation) 2 Early stop if the validation error does not continue decreasing Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

24 Training 101 Before training a feedforward NN, remember to standardize (z-normalize) the input Prevents dominating features Improves conditioning When training, remember to: 1 Initialize all weights to small random values Breaks symmetry between different units so they are not updated in the same way Biases b (k) s may be initialized to zero (or to small positive values for ReLUs to prevent too much saturation) 2 Early stop if the validation error does not continue decreasing Prevents overfitting Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 8 / 60

25 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 9 / 60

26 Momentum Update rule in SGD: Q (t+1) Q (t) hg (t) where g (t) = Q C(Q (t) ) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60

27 Momentum Update rule in SGD: Q (t+1) Q (t) hg (t) where g (t) = Q C(Q (t) ) Gets stuck in local minima or saddle points Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60

28 Momentum Update rule in SGD: Q (t+1) Q (t) hg (t) where g (t) = Q C(Q (t) ) Gets stuck in local minima or saddle points Momentum: make the same movement v (t) in the last iteration, corrected by negative gradient: v (t+1) lv (t) (1 l)g (t) Q (t+1) v (t) is a moving average of Q (t) + hv (t+1) g (t) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 10 / 60

29 Nesterov Momentum Make the same movement v (t) in the last iteration, corrected by lookahead negative gradient: Q (t+1) Q (t) + hv (t) v (t+1) lv (t) (1 l) Q C( Q (t) ) Q (t+1) Q (t) + hv (t+1) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60

30 Nesterov Momentum Make the same movement v (t) in the last iteration, corrected by lookahead negative gradient: Q (t+1) Q (t) + hv (t) v (t+1) lv (t) (1 l) Q C( Q (t) ) Q (t+1) Q (t) + hv (t+1) Faster convergence to a minimum Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60

31 Nesterov Momentum Make the same movement v (t) in the last iteration, corrected by lookahead negative gradient: Q (t+1) Q (t) + hv (t) v (t+1) lv (t) (1 l) Q C( Q (t) ) Q (t+1) Q (t) + hv (t+1) Faster convergence to a minimum Not helpful for NNs that lack of minima Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 11 / 60

32 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 12 / 60

33 Where Does SGD Spend Its Training Time? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 13 / 60

34 Where Does SGD Spend Its Training Time? 1 Detouring a saddle point of high cost Better initialization 2 Traversing the relatively flat valley Adaptive learning rate Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 13 / 60

35 SGD with Adaptive Learning Rates Smaller learning rate h along a steep direction Prevents overshooting Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60

36 SGD with Adaptive Learning Rates Smaller learning rate h along a steep direction Prevents overshooting Larger learning rate h along a flat direction Speed up convergence Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60

37 SGD with Adaptive Learning Rates Smaller learning rate h along a steep direction Prevents overshooting Larger learning rate h along a flat direction Speed up convergence How? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 14 / 60

38 AdaGrad Update rule: r (t+1) r (t) + g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60

39 AdaGrad Update rule: r (t+1) r (t) + g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) r (t+1) accumulates squared gradients along each axis Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60

40 AdaGrad Update rule: r (t+1) r (t) + g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) We have r (t+1) accumulates squared gradients along each axis Division and square root applied to r (t+1) elementwisely h p r (t+1) = p h 1 q = t t+1 r(t+1) h 1 p q t t+1 Ât i=0 g(i) g (i) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60

41 AdaGrad Update rule: r (t+1) r (t) + g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) We have r (t+1) accumulates squared gradients along each axis Division and square root applied to r (t+1) elementwisely h p r (t+1) = p h 1 q = t t+1 r(t+1) h 1 p q t t+1 Ât i=0 g(i) g (i) 1 Smaller learning rate along all directions as t grows Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60

42 AdaGrad Update rule: r (t+1) r (t) + g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) We have r (t+1) accumulates squared gradients along each axis Division and square root applied to r (t+1) elementwisely h p r (t+1) = p h 1 q = t t+1 r(t+1) h 1 p q t t+1 Ât i=0 g(i) g (i) 1 Smaller learning rate along all directions as t grows 2 Larger learning rate along more gently sloped directions Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 15 / 60

43 Limitations The optimal learning rate along a direction may change over time Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 16 / 60

44 Limitations The optimal learning rate along a direction may change over time In AdaGrad, r (t+1) accumulates squared gradients from the beginning of training Results in premature adaptivity Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 16 / 60

45 RMSProp RMSProp changes the gradient accumulation in r (t+1) into a moving average: r (t+1) lr (t) + (1 l)g (t) g (t) Q (t+1) Q (t) h p r (t+1) g (t) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 17 / 60

46 RMSProp RMSProp changes the gradient accumulation in r (t+1) into a moving average: r (t+1) lr (t) + (1 l)g (t) g (t) Q (t+1) Q (t) h p r (t+1) ApopularalgorithmAdam (short for adaptive moments) [7]isa combination of RMSProp and Momentum: g (t) v (t+1) l 1 v (t) (1 l 1 )g (t) r (t+1) l 2 r (t) +(1 l 2 )g (t) g (t) Q (t+1) Q (t) + h p r (t+1) v (t+1) With some bias corrections for v (t+1) and r (t+1) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 17 / 60

47 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 18 / 60

48 Training Deep NNs I So far, we modify the optimization algorithm to better train the model Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60

49 Training Deep NNs I So far, we modify the optimization algorithm to better train the model Can we modify the model to ease the optimization task? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60

50 Training Deep NNs I So far, we modify the optimization algorithm to better train the model Can we modify the model to ease the optimization task? What are the difficulties in training a deep NN? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 19 / 60

51 Training Deep NNs II The cost C(Q) of a deep NN is usually ill-conditioned due to the dependency between W (k) s at different layers Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60

52 Training Deep NNs II The cost C(Q) of a deep NN is usually ill-conditioned due to the dependency between W (k) s at different layers As a simple example, consider a deep NN for x,y 2 R: ŷ = f (x)=xw (1) w (2) w (L) Single unit at each layer Linear activation function and no bias in each unit Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60

53 Training Deep NNs II The cost C(Q) of a deep NN is usually ill-conditioned due to the dependency between W (k) s at different layers As a simple example, consider a deep NN for x,y 2 R: ŷ = f (x)=xw (1) w (2) w (L) Single unit at each layer Linear activation function and no bias in each unit The output ŷ is a linear function of x, butnot of weights Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60

54 Training Deep NNs II The cost C(Q) of a deep NN is usually ill-conditioned due to the dependency between W (k) s at different layers As a simple example, consider a deep NN for x,y 2 R: ŷ = f (x)=xw (1) w (2) w (L) Single unit at each layer Linear activation function and no bias in each unit The output ŷ is a linear function of x, butnot of weights The curvature of f with respect to any two w (i) and w (j) is f w (i) w (j) =(w(i) + w (j) ) x w (k) k6=i,j Very small if L is large and w (k) < 1 for k 6= i,j Very large if L is large and w (k) > 1 for k 6= i,j Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 20 / 60

55 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

56 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

57 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

58 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

59 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration C(Q (t+1) ) will be guaranteed to decrease only if C is linear at Q (t) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

60 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration C(Q (t+1) ) will be guaranteed to decrease only if C is linear at Q (t) Wrong assumption: Q (t+1) i updated simultaneously will decrease C even if other Q (t+1) j s are Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

61 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration C(Q (t+1) ) will be guaranteed to decrease only if C is linear at Q (t) Wrong assumption: Q (t+1) i updated simultaneously Second-order methods? will decrease C even if other Q (t+1) j s are Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

62 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration C(Q (t+1) ) will be guaranteed to decrease only if C is linear at Q (t) Wrong assumption: Q (t+1) i will decrease C even if other Q (t+1) j s are updated simultaneously Second-order methods? Time consuming Does not take into account high-order effects Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

63 Training Deep NNs III The ill-conditioned C(Q) makes a gradient-based optimization algorithm (e.g., SGD) inefficient Let Q =[w (1),w (2),,w (L) ] > and g (t) = Q C(Q (t) ) In gradient descent, we get Q (t+1) by Q (t+1) Q (t) hg (t) based on the first-order Taylor approximation of C The gradient g (t) i = C w (i) (Q(t) ) is calculated individually by fixing C(Q (t) ) in other dimensions (w (j) s, j 6= i) However, g (t) updates Q (t) in all dimensions simultaneously in the same iteration C(Q (t+1) ) will be guaranteed to decrease only if C is linear at Q (t) Wrong assumption: Q (t+1) i will decrease C even if other Q (t+1) j s are updated simultaneously Second-order methods? Time consuming Does not take into account high-order effects Can we change the model to make this assumption not-so-wrong? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 21 / 60

64 Batch Normalization I ŷ = f (x)=xw (1) w (2) w (L) Why not standardize each hidden activation a (k), k = 1,,L we standardized x)? 1 (as Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60

65 Batch Normalization I ŷ = f (x)=xw (1) w (2) w (L) Why not standardize each hidden activation a (k), k = 1,,L we standardized x)? We have ŷ = a (L When a (L 1) is standardized, g (t) L decrease C 1) w (L) = C w (L) (Q (t) ) is more likely to 1 (as Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60

66 Batch Normalization I ŷ = f (x)=xw (1) w (2) w (L) Why not standardize each hidden activation a (k), k = 1,,L we standardized x)? We have ŷ = a (L 1) w (L) When a (L 1) is standardized, g (t) L = C (Q (t) ) is more likely to w (L) decrease C If x N (0,1), then still a (L 1) N (0,1), no matter how w (1),,w (L 1) change 1 (as Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60

67 Batch Normalization I ŷ = f (x)=xw (1) w (2) w (L) Why not standardize each hidden activation a (k), k = 1,,L we standardized x)? We have ŷ = a (L 1) w (L) 1 (as When a (L 1) is standardized, g (t) L = C (Q (t) ) is more likely to w (L) decrease C If x N (0,1), then still a (L 1) N (0,1), no matter how w (1),,w (L 1) change Changes in other dimensions proposed by g (t) i s, i 6= L, canbezeroed out Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60

68 Batch Normalization I ŷ = f (x)=xw (1) w (2) w (L) Why not standardize each hidden activation a (k), k = 1,,L we standardized x)? We have ŷ = a (L 1) w (L) 1 (as When a (L 1) is standardized, g (t) L = C (Q (t) ) is more likely to w (L) decrease C If x N (0,1), then still a (L 1) N (0,1), no matter how w (1),,w (L 1) change Changes in other dimensions proposed by g (t) i s, i 6= L, canbezeroed out Similarly, if a (k 1) is standardized, g (t) k = C (Q (t) ) is more likely to w (k) decrease C Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 22 / 60

69 Batch Normalization II How to standardize a (k) at training and test time? We can standardize the input x because we see multiple examples Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60

70 Batch Normalization II How to standardize a (k) at training and test time? We can standardize the input x because we see multiple examples During training time, we see a minibatch of activations a (k) 2 R M (M the batch size) Batch normalization [6]: ã (k) i = a(k) i µ (k) s (k),8i µ (k) and s (k) are mean and std of activations across examples in the minibatch Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60

71 Batch Normalization II How to standardize a (k) at training and test time? We can standardize the input x because we see multiple examples During training time, we see a minibatch of activations a (k) 2 R M (M the batch size) Batch normalization [6]: ã (k) i = a(k) i µ (k) s (k),8i µ (k) and s (k) are mean and std of activations across examples in the minibatch At test time, µ (k) and s (k) can be replaced by running averages that were collected during training time Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60

72 Batch Normalization II How to standardize a (k) at training and test time? We can standardize the input x because we see multiple examples During training time, we see a minibatch of activations a (k) 2 R M (M the batch size) Batch normalization [6]: ã (k) i = a(k) i µ (k) s (k),8i µ (k) and s (k) are mean and std of activations across examples in the minibatch At test time, µ (k) and s (k) can be replaced by running averages that were collected during training time Can be readily extended to NNs having multiple neurons at each layer Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 23 / 60

73 Standardizing Nonlinear Units How to standardize a nonlinear unit a (k) = act(z (k) )? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60

74 Standardizing Nonlinear Units How to standardize a nonlinear unit a (k) = act(z (k) )? We can still zero out the effects from other layers by normalizing z (k) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60

75 Standardizing Nonlinear Units How to standardize a nonlinear unit a (k) = act(z (k) )? We can still zero out the effects from other layers by normalizing z (k) Given a minibatch of z (k) 2 R M : z (k) i = z(k) i µ (k) s (k),8i Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60

76 Standardizing Nonlinear Units How to standardize a nonlinear unit a (k) = act(z (k) )? We can still zero out the effects from other layers by normalizing z (k) Given a minibatch of z (k) 2 R M : z (k) i A hidden unit now looks like: = z(k) i µ (k) s (k),8i Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 24 / 60

77 Expressiveness I The weights W (k) at each layer is easier to train now The wrong assumption of gradient-based optimization is made valid Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60

78 Expressiveness I The weights W (k) at each layer is easier to train now The wrong assumption of gradient-based optimization is made valid But at the cost of expressiveness Normalizing a (k) or z (k) limits the output range of a unit Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60

79 Expressiveness I The weights W (k) at each layer is easier to train now The wrong assumption of gradient-based optimization is made valid But at the cost of expressiveness Normalizing a (k) or z (k) limits the output range of a unit Observe that there is no need to insist a z (k) to have zero mean and unit variance We only care about whether it is fixed when calculating the gradients for other layers Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 25 / 60

80 Expressiveness II During training time, we can introduce two parameters g and b and back-propagate through to learn their best values g z (k) + b Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60

81 Expressiveness II During training time, we can introduce two parameters g and b and back-propagate through g z (k) + b to learn their best values Question: g and b can be learned to invert z (k) to get z (k), so what s the point? z (k) = z(k) µ (k) s (k), so g z (k) + b = s z (k) + µ = z (k) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60

82 Expressiveness II During training time, we can introduce two parameters g and b and back-propagate through g z (k) + b to learn their best values Question: g and b can be learned to invert z (k) to get z (k), so what s the point? z (k) = z(k) µ (k) s (k), so g z (k) + b = s z (k) + µ = z (k) The weights W (k), g, andb are now easier to learn with SGD Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 26 / 60

83 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 27 / 60

84 Parameter Initialization Initialization is important Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60

85 Parameter Initialization Initialization is important How to better initialize Q (0)? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60

86 Parameter Initialization Initialization is important How to better initialize Q (0)? 1 Train an NN multiple times with random initial points, and then pick the best Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60

87 Parameter Initialization Initialization is important How to better initialize Q (0)? 1 Train an NN multiple times with random initial points, and then pick the best 2 Design a series of cost functions such that a solution to one is a good initial point of the next Solve the easy problem first, and then a harder one, and so on Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 28 / 60

88 Continuation Methods I Continuation methods: constructeasiercostfunctionsby smoothing the original cost function: C(Q)=E Q N (Q,s 2 ) C( Q) In practice, we sample several Q s to approximate the expectation Assumption: some non-convex functions become approximately convex when smoothen Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 29 / 60

89 Continuation Methods II Problems? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60

90 Continuation Methods II Problems? Cost function might not become convex, no matter how much it is smoothen Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60

91 Continuation Methods II Problems? Cost function might not become convex, no matter how much it is smoothen Designed to deal with local minima; not very helpful for NNs without minima Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 30 / 60

92 Curriculum Learning Curriculum learning (or shaping) [1]:makethecostfunctioneasier by increasing the influence of simpler examples E.g., by assigning them larger weights in the new cost function Or, by sampling them more frequently Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60

93 Curriculum Learning Curriculum learning (or shaping) [1]:makethecostfunctioneasier by increasing the influence of simpler examples E.g., by assigning them larger weights in the new cost function Or, by sampling them more frequently How to define simple examples? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60

94 Curriculum Learning Curriculum learning (or shaping) [1]:makethecostfunctioneasier by increasing the influence of simpler examples E.g., by assigning them larger weights in the new cost function Or, by sampling them more frequently How to define simple examples? Face image recognition: front view (easy) vs. side view (hard) Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs. 1-/2-/3-/4-star reviews (hard) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60

95 Curriculum Learning Curriculum learning (or shaping) [1]:makethecostfunctioneasier by increasing the influence of simpler examples E.g., by assigning them larger weights in the new cost function Or, by sampling them more frequently How to define simple examples? Face image recognition: front view (easy) vs. side view (hard) Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs. 1-/2-/3-/4-star reviews (hard) Learn simple concepts first, then learn more complex concepts that depend on these simpler concepts Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60

96 Curriculum Learning Curriculum learning (or shaping) [1]:makethecostfunctioneasier by increasing the influence of simpler examples E.g., by assigning them larger weights in the new cost function Or, by sampling them more frequently How to define simple examples? Face image recognition: front view (easy) vs. side view (hard) Sentiment analysis for movie reviews: 0-/5-star reviews (easy) vs. 1-/2-/3-/4-star reviews (hard) Learn simple concepts first, then learn more complex concepts that depend on these simpler concepts Just like how humans learn Knowing the principles, we are less likely to explain an observation using special (but wrong) rules Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 31 / 60

97 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 32 / 60

98 Regularization The goal of an ML algorithm is to perform well not just on the training data, but also on new inputs Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60

99 Regularization The goal of an ML algorithm is to perform well not just on the training data, but also on new inputs Regularization: techniquesthatreducethegeneralizationerrorofan ML algorithm But not the training error Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60

100 Regularization The goal of an ML algorithm is to perform well not just on the training data, but also on new inputs Regularization: techniquesthatreducethegeneralizationerrorofan ML algorithm But not the training error By expressing preference to a simpler model Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60

101 Regularization The goal of an ML algorithm is to perform well not just on the training data, but also on new inputs Regularization: techniquesthatreducethegeneralizationerrorofan ML algorithm But not the training error By expressing preference to a simpler model By providing different perspectives on how to explain the training data Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60

102 Regularization The goal of an ML algorithm is to perform well not just on the training data, but also on new inputs Regularization: techniquesthatreducethegeneralizationerrorofan ML algorithm But not the training error By expressing preference to a simpler model By providing different perspectives on how to explain the training data By encoding prior knowledge Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 33 / 60

103 Regularization in Deep Learning I Ihavebigdata,doIstillneedtoregularizemyNN? The excess error is dominated by optimization error (time) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60

104 Regularization in Deep Learning I Ihavebigdata,doIstillneedtoregularizemyNN? The excess error is dominated by optimization error (time) Generally, yes! Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60

105 Regularization in Deep Learning I Ihavebigdata,doIstillneedtoregularizemyNN? The excess error is dominated by optimization error (time) Generally, yes! For hard problems, the true data generating process is almost certainly outside the model family E.g., problems in images, audio sequences, and text domains Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60

106 Regularization in Deep Learning I Ihavebigdata,doIstillneedtoregularizemyNN? The excess error is dominated by optimization error (time) Generally, yes! For hard problems, the true data generating process is almost certainly outside the model family E.g., problems in images, audio sequences, and text domains The true generation process essentially involves simulating the entire universe Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60

107 Regularization in Deep Learning I Ihavebigdata,doIstillneedtoregularizemyNN? The excess error is dominated by optimization error (time) Generally, yes! For hard problems, the true data generating process is almost certainly outside the model family E.g., problems in images, audio sequences, and text domains The true generation process essentially involves simulating the entire universe In these domains, the best fitting model (with lowest generalization error) is usually a larger model regularized appropriately Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 34 / 60

108 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

109 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined For example, when applying a logistic regression to a linearly separable dataset: argmax w log i P(y (i) x (i) ;w) = argmax w log i s(w > (i)) y(i) [1 s(w > x (i) )] (1 y(i) ) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

110 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined For example, when applying a logistic regression to a linearly separable dataset: argmax w log i P(y (i) x (i) ;w) = argmax w log i s(w > (i)) y(i) [1 s(w > x (i) )] (1 y(i) ) If a weight vector w is able to achieve perfect classification, so is 2w Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

111 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined For example, when applying a logistic regression to a linearly separable dataset: argmax w log i P(y (i) x (i) ;w) = argmax w log i s(w > (i)) y(i) [1 s(w > x (i) )] (1 y(i) ) If a weight vector w is able to achieve perfect classification, so is 2w Furthermore, 2w gives higher likelihood Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

112 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined For example, when applying a logistic regression to a linearly separable dataset: argmax w log i P(y (i) x (i) ;w) = argmax w log i s(w > (i)) y(i) [1 s(w > x (i) )] (1 y(i) ) If a weight vector w is able to achieve perfect classification, so is 2w Furthermore, 2w gives higher likelihood Without regularization, SGD will continually increase w s magnitude Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

113 Regularization in Deep Learning II For easy problems, regularization may be necessary to make the problems well defined For example, when applying a logistic regression to a linearly separable dataset: argmax w log i P(y (i) x (i) ;w) = argmax w log i s(w > (i)) y(i) [1 s(w > x (i) )] (1 y(i) ) If a weight vector w is able to achieve perfect classification, so is 2w Furthermore, 2w gives higher likelihood Without regularization, SGD will continually increase w s magnitude AdeepNNislikelytoseparableadatasetandhasthesimilarissue Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 35 / 60

114 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 36 / 60

115 Weight Decay To add norm penalties: W can be, e.g., L 1 -orl 2 -norm arg min C(Q)+aW(Q) Q Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60

116 Weight Decay To add norm penalties: W can be, e.g., L 1 -orl 2 -norm arg min C(Q)+aW(Q) Q W(W), W(W (k) ), W(W (k) i,: ),orw(w(k) :,j )? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60

117 Weight Decay To add norm penalties: arg min C(Q)+aW(Q) Q W can be, e.g., L 1 -orl 2 -norm W(W), W(W (k) ), W(W (k) i,: ),orw(w(k) :,j )? Limiting column norms W(W (k) :,j ), 8j,k, ispreferred[5] Prevents any one hidden unit from having very large weights and z (k) j Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 37 / 60

118 Explicit Weight Decay I Explicit norm penalties: arg minc(q) subject to W(Q) apple R Q Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60

119 Explicit Weight Decay I Explicit norm penalties: arg minc(q) subject to W(Q) apple R Q To solve the problem, we can use the projective SGD: At each step t, update Q (t+1) as in SGD If Q (t+1) falls out of the feasible set, project Q (t+1) back to the tangent space (edge) of feasible set Advantage? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60

120 Explicit Weight Decay I Explicit norm penalties: arg minc(q) subject to W(Q) apple R Q To solve the problem, we can use the projective SGD: At each step t, update Q (t+1) as in SGD If Q (t+1) falls out of the feasible set, project Q (t+1) back to the tangent space (edge) of feasible set Advantage? Prevents dead units that do not contribute much to the behavior of NN due to too small weights Explicit constraints does not push weights to the origin Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 38 / 60

121 Explicit Weight Decay II Also prevents instability due to a large learning rate Reprojection clips the weights and improves numeric stability Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 39 / 60

122 Explicit Weight Decay II Also prevents instability due to a large learning rate Reprojection clips the weights and improves numeric stability Hinton et al. [5] recommend using: explicit constraints + reprojection + large learning rate to allow rapid exploration of parameter space while maintaining numeric stability Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 39 / 60

123 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 40 / 60

124 Data Augmentation Theoretically, the best way to improve the generalizability of a model is to train it on more data Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60

125 Data Augmentation Theoretically, the best way to improve the generalizability of a model is to train it on more data For some ML tasks, it is not hard to create new fake data In classification, we can generate new (x, y) pairs by transforming an example input x (i) given the same y (i) E.g, scaling, translating, rotating, or flipping images (x (i) s) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60

126 Data Augmentation Theoretically, the best way to improve the generalizability of a model is to train it on more data For some ML tasks, it is not hard to create new fake data In classification, we can generate new (x, y) pairs by transforming an example input x (i) given the same y (i) E.g, scaling, translating, rotating, or flipping images (x (i) s) Very effective in image object recognition and speech recognition tasks Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60

127 Data Augmentation Caution Theoretically, the best way to improve the generalizability of a model is to train it on more data For some ML tasks, it is not hard to create new fake data In classification, we can generate new (x, y) pairs by transforming an example input x (i) given the same y (i) E.g, scaling, translating, rotating, or flipping images (x (i) s) Very effective in image object recognition and speech recognition tasks Do not to apply transformations that would change the correct class! Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60

128 Data Augmentation Caution Theoretically, the best way to improve the generalizability of a model is to train it on more data For some ML tasks, it is not hard to create new fake data In classification, we can generate new (x, y) pairs by transforming an example input x (i) given the same y (i) E.g, scaling, translating, rotating, or flipping images (x (i) s) Very effective in image object recognition and speech recognition tasks Do not to apply transformations that would change the correct class! E.g., in OCR tasks, avoid: Horizontal flips for b and d 180 rotations for 6 and 9 Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 41 / 60

129 Noise and Adversarial Data NNs are not very robust to the perturbation of input (x (i) s) Noises [11] Adversarial points [3] Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 42 / 60

130 Noise and Adversarial Data NNs are not very robust to the perturbation of input (x (i) s) Noises [11] Adversarial points [3] How to improve the robustness? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 42 / 60

131 Noise Injection We can train an NN with artificial random noise applied to x (i) s Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60

132 Noise Injection We can train an NN with artificial random noise applied to x (i) s Why noise injection works? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60

133 Noise Injection We can train an NN with artificial random noise applied to x (i) s Why noise injection works? Recall that the analytic solution of Ridge regression is 1 w = X > X + a I (t) X > y In this case, weight decay = adding variance (noises) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60

134 Noise Injection We can train an NN with artificial random noise applied to x (i) s Why noise injection works? Recall that the analytic solution of Ridge regression is 1 w = X > X + a I (t) X > y In this case, weight decay = adding variance (noises) More generally, makes the function f locally constant Cost function C insensitive to small variations in weights Finds solutions that are not merely minima, but minima surrounded by flat regions Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60

135 Noise Injection We can train an NN with artificial random noise applied to x (i) s Why noise injection works? Recall that the analytic solution of Ridge regression is 1 w = X > X + a I (t) X > y In this case, weight decay = adding variance (noises) More generally, makes the function f locally constant Cost function C insensitive to small variations in weights Finds solutions that are not merely minima, but minima surrounded by flat regions Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 43 / 60

136 Variants We can also inject noise to hidden representations [8] Highly effective provided that the magnitude of the noise can be carefully tuned Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60

137 Variants We can also inject noise to hidden representations [8] Highly effective provided that the magnitude of the noise can be carefully tuned The batch normalization, in addition to simplifying optimization, offers similar regularization effect to noise injection Injects noises from examples in a minibatch to an activation a (k) j Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60

138 Variants We can also inject noise to hidden representations [8] Highly effective provided that the magnitude of the noise can be carefully tuned The batch normalization, in addition to simplifying optimization, offers similar regularization effect to noise injection Injects noises from examples in a minibatch to an activation a (k) j How about injecting noise to outputs (y (i) s)? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60

139 Variants We can also inject noise to hidden representations [8] Highly effective provided that the magnitude of the noise can be carefully tuned The batch normalization, in addition to simplifying optimization, offers similar regularization effect to noise injection Injects noises from examples in a minibatch to an activation a (k) j How about injecting noise to outputs (y (i) s)? Already done in probabilistic models Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 44 / 60

140 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 45 / 60

141 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

142 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

143 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

144 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

145 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Ensemble methods in deep learning? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

146 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Ensemble methods in deep learning? Voting: train multiple NNs Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

147 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Ensemble methods in deep learning? Voting: train multiple NNs Bagging: train multiple NNs, each with resampled X Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

148 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Ensemble methods in deep learning? Voting: train multiple NNs Bagging: train multiple NNs, each with resampled X GoogleLeNet [10], winner of ILSVRC 14, is an ensemble of 6 NNs Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

149 Ensemble Methods Ensemble methods can improve generalizability by offering different explanations to X Voting: reduces variance of predictions if having independent voters Bagging: resample X to makes voters less dependent Boosting: increase confidence (margin) of predictions, if not overfitting Ensemble methods in deep learning? Voting: train multiple NNs Bagging: train multiple NNs, each with resampled X GoogleLeNet [10], winner of ILSVRC 14, is an ensemble of 6 NNs Very time consuming to ensemble a large number of NNs Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 46 / 60

150 Dropout I Dropout: afeature-based bagging Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

151 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

152 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features With parameter sharing among voters Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

153 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features With parameter sharing among voters SGD training: each time loading a minibatch, randomly sample a binary mask to apply to all input and hidden units Each unit has probability a to be included (a hyperparameter) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

154 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features With parameter sharing among voters SGD training: each time loading a minibatch, randomly sample a binary mask to apply to all input and hidden units Each unit has probability a to be included (a hyperparameter) Typically, 0.8 for input units and 0.5 for hidden units Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

155 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features With parameter sharing among voters SGD training: each time loading a minibatch, randomly sample a binary mask to apply to all input and hidden units Each unit has probability a to be included (a hyperparameter) Typically, 0.8 for input units and 0.5 for hidden units Different minibatches are used to train different parts of the NN Similar to bagging, but much more efficient Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

156 Dropout I Dropout: afeature-based bagging Resamples input as well as latent features With parameter sharing among voters SGD training: each time loading a minibatch, randomly sample a binary mask to apply to all input and hidden units Each unit has probability a to be included (a hyperparameter) Typically, 0.8 for input units and 0.5 for hidden units Different minibatches are used to train different parts of the NN Similar to bagging, but much more efficient No need to retrain unmasked units Exponential number of voters Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 47 / 60

157 Dropout II How to vote to make a final prediction? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60

158 Dropout II How to vote to make a final prediction? Mask sampling: 1 Randomly sample some (typically, 10 20) masks 2 For each mask, apply it to the trained NN and get a prediction 3 Average the predictions Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60

159 Dropout II How to vote to make a final prediction? Mask sampling: 1 Randomly sample some (typically, 10 20) masks 2 For each mask, apply it to the trained NN and get a prediction 3 Average the predictions Weigh scaling: Make a single prediction using the NN with all units But weights going out from a unit is multiplied by a Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60

160 Dropout II How to vote to make a final prediction? Mask sampling: 1 Randomly sample some (typically, 10 20) masks 2 For each mask, apply it to the trained NN and get a prediction 3 Average the predictions Weigh scaling: Make a single prediction using the NN with all units But weights going out from a unit is multiplied by a Heuristic: each unit outputs the same expected amount of weight as in training Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60

161 Dropout II How to vote to make a final prediction? Mask sampling: 1 Randomly sample some (typically, 10 20) masks 2 For each mask, apply it to the trained NN and get a prediction 3 Average the predictions Weigh scaling: Make a single prediction using the NN with all units But weights going out from a unit is multiplied by a Heuristic: each unit outputs the same expected amount of weight as in training The better one is problem dependent Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 48 / 60

162 Dropout III Dropout improves generalization beyond ensembling Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60

163 Dropout III Dropout improves generalization beyond ensembling For example, in face image recognition: Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60

164 Dropout III Dropout improves generalization beyond ensembling For example, in face image recognition: If there is a unit that detects nose Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60

165 Dropout III Dropout improves generalization beyond ensembling For example, in face image recognition: If there is a unit that detects nose Dropping the unit encourages the model to learn mouth (or nose again) in another unit Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 49 / 60

166 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 50 / 60

167 Manifolds I One way to improve the generalizability of a model is to incorporate the prior knowledge Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60

168 Manifolds I One way to improve the generalizability of a model is to incorporate the prior knowledge In many applications, data of the same class concentrate around one or more low-dimensional manifolds Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60

169 Manifolds I One way to improve the generalizability of a model is to incorporate the prior knowledge In many applications, data of the same class concentrate around one or more low-dimensional manifolds Amanifoldisatopologicalspacethatarelinear locally Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 51 / 60

170 Manifolds II For each point x on a manifold, we have its tangent space spanned by tangent vectors Local directions specify how one can change x infinitesimally while staying on the manifold Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 52 / 60

171 Tangent Prop How to incorporate the manifold prior into a model? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60

172 Tangent Prop How to incorporate the manifold prior into a model? Suppose we have the tangent vectors {v (i,j) } j for each example x (i) Tangent Prop [9] trains an NN classifier f with cost penalty: W[f ]=Â x f (x (i) ) > v (i,j) i,j To make f local constant along tangent directions Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60

173 Tangent Prop How to incorporate the manifold prior into a model? Suppose we have the tangent vectors {v (i,j) } j for each example x (i) Tangent Prop [9] trains an NN classifier f with cost penalty: W[f ]=Â x f (x (i) ) > v (i,j) i,j To make f local constant along tangent directions How to obtain {v (i,j) } j? Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60

174 Tangent Prop How to incorporate the manifold prior into a model? Suppose we have the tangent vectors {v (i,j) } j for each example x (i) Tangent Prop [9] trains an NN classifier f with cost penalty: W[f ]=Â x f (x (i) ) > v (i,j) i,j To make f local constant along tangent directions How to obtain {v (i,j) } j? Manually specified based on domain knowledge Images: scaling, translating, rotating, flipping etc. Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60

175 Tangent Prop How to incorporate the manifold prior into a model? Suppose we have the tangent vectors {v (i,j) } j for each example x (i) Tangent Prop [9] trains an NN classifier f with cost penalty: W[f ]=Â x f (x (i) ) > v (i,j) i,j To make f local constant along tangent directions How to obtain {v (i,j) } j? Manually specified based on domain knowledge Images: scaling, translating, rotating, flipping etc. Or learned automatically (to be discussed later) Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 53 / 60

176 Outline 1 Optimization Momentum & Nesterov Momentum AdaGrad & RMSProp Batch Normalization Continuation Methods & Curriculum Learning 2 Regularization Weight Decay Data Augmentation Dropout Manifold Regularization Domain-Specific Model Design Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 54 / 60

177 Domain-Specific Prior Knowledge If done right, incorporating the domain-specific prior knowledge into a model is a highly effective way the improve generalizability Better f that makes sense May also simplify optimization problem Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 55 / 60

178 Word2vec Weight-tying leads to simpler model Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 56 / 60

179 Convolution Neural Networks Locally connected neurons for pattern detection at different locations Shan-Hung Wu (CS, NTHU) NN Opt & Reg Machine Learning 57 / 60

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Numerical Optimization

Numerical Optimization Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Regularization in Neural Networks

Regularization in Neural Networks Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Cross Validation & Ensembling

Cross Validation & Ensembling Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Parameter Norm Penalties. Sargur N. Srihari

Parameter Norm Penalties. Sargur N. Srihari Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information