Adam: A Method for Stochastic Optimization

Size: px
Start display at page:

Download "Adam: A Method for Stochastic Optimization"

Transcription

1 Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by

2 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 2

3 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 3

4 4

5 5

6 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown. 7

7 Examples of instance and labels spaces X Y The space of all RGB images of some dimension Is there a cat in the image ({0,1}) The space of a stock s historical price sequences The stock s next day s closing price ([0, )) The space of all finite Chinese sentences The set of all finite English sentences The space of all information regarding two companies A probability of a merge to be successful The space of all MRI images A probability, location and type of a tumor The space of all finite length voice sequences The corresponding Amazon product being referred

8 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown 9

9 Basic Supervised Machine Learning Terminology Notation! Instance space " Label space Explanation!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown {(* +, - + )} 2 +01, * +!, - + " Training set h:! ", h 5 Hypothesis, the object we wish to learn 10

10 Basic Supervised Machine Learning theory The goal is to find a hypothesis h that approximates " as best as possible, but approximates in what sense? We need a way for evaluating a hypothesis quality 11

11 Basic Supervised Machine Learning theory Let us define a loss function!: # # [0, ) For example: Zero-one loss +(h(. / ) 1 / ) Quadratic loss h(. / ) 1 / 3 Here we use: 1 / = 5(. / ) A measure of how bad the hypothesis did on a single sample 12

12 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. 13

13 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. Unfortunately 14

14 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. 15

15 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. We can, however, compute a proxy of the true Risk, called the Empirical Risk. 16

16 Basic Supervised Machine Learning theory Let us define the Empirical Risk of a hypothesis h. " #$% h = 1 ( ) * +,- / h 0 +, Recall: {(0 +, 3 + )} +,-., 0 + 9, 3 + : is the Training set " ; h = < = >~; / h 0 +, 2(0 + ) 17

17 Empirical Risk Minimization (ERM) strategy After mountains of theory (PAC Learning, VC Theory, etc.), the following theorem is proven (this is very unformal): The best strategy for a learning algorithm is to minimize the Empirical Risk 18

18 Empirical Risk Minimization strategy (ERM) This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem: Find!h such that:!h = $%&'() *, -./0 h = $%&'() *, h : 5, < = 5 19

19 The hypothesis class can be very simple when! # $ and % { 1,1} + is class of all three dimensional hyper planes h =. / $ 2 $ 4 = {. /,. 1,. 3,. $ } # 5 21

20 The hypothesis class can be very complex Resent year s Deep learning architectures, result in models with tens of millions of parameters,! # $(&'() 22

21 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. 4 1!h = $%&'() *,. / h 6 1,

22 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. We need a principal method for achieving this goal. 4 1!h = $%&'() *,. / h 6 1,

23 Introducing The Gradient method 25

24 Questions?

25 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 27

26 A couple of notes before we head on I stick to the papers notations. From here we ll use! ", which is the same as # $%& h. 28

27 Introducing The Gradient method Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & ( ) arbitratily General Step: Set % *+, = % *! 0 1 % * if 0 1 % *+, ", then STOP and % *+, is the output 29

28 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2'

29 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2' + 4 The goal is to find good 2 and 4 values 31

30 Gradient descent example simple Linear regression h = #$ + & ' h $, ) = ) h($) -. /01 h = ' h $ 6, 8 ) 6 = ) 6 h $ 6 - = ) 6 (#$ 6 + &) - 32

31 GD example computing the gradient For!: " "# $ %#& h = " ) "# * + * -.) / - (! ) 6 = ) * + * " -.) "# / - (! ) 6 = ) * -.) * 2 / - (! ) ( 2 - ) = 8 * 8! $ = 2 9 : -.) 2 - / - (! ) Similarly for 4: 8 * 84 $ = 2 9 : -.) / - (! ) 33

32 GD example computing the gradient So the gradient vector is: " # $ = & &' #, & &) # 1 = 2, -./ '2. + ), 2, -./ ('2. + )) 34

33 GD example the complete algorithm Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & = (m, b). / arbitratily General Step: Set % 012 = % 0! 5 6 % = 7 0 +! / 9 ;<2 9 = ; > ; 7= ; +?? 012 =? 0 +! / 9 ;<2 9 > ; 7= ; +? if 5 6 % 012 ", then STOP and % 012 is the output 35

34 GD example visualized 36

35 Variants of Gradient descent Differ in how much data we use to compute the gradients of the objective function. A trade-off between the accuracy of the update and the computation time per update. 38

36 Batch Gradient descent Computes the gradients of the function w.r.t the parameters! for the entire training dataset.! =! $ % ' (!; * (,:.), 1 (,:.) 39

37 Batch Gradient descent Pros and Cons Pros Guaranteed to converge to global/local minimum. An unbiased estimate of gradients. Cons Possibly slow or impossible to compute. Some examples may be redundant. Converges to the minimum of the basin the parameters are placed in. 40

38 Stochastic Gradient descent (SGD) Computes the gradients of the function w.r.t the parameters! for a single training sample.! =! $ % ' (!; * (,), / (,) 41

39 Stochastic Gradient descent Pros and Cons Pros Much faster to compute. Potential to jump to better basins (and better local minima). Cons High variance that causes the objective to fluctuate heavily. 42

40 Mini-batch Gradient descent Computes the gradients of the function w.r.t the parameters! for a mini-batch of " training sample.! =! % & ( )!; + (-:-/0), 3 (-:-/0) " is usually 32 to

41 Mini-batch Gradient descent Pros and Cons the best of both worlds - fast, explorational and allows for stable convergence. Makes use of highly optimized matrix optimizations libraries and hardware. The method of for most Supervised Machine leaning scenarios. 44

42 Variants of Gradient descent - visualizations 45

43 Its all Mini-batch from here The remaining of the presentation will focus on variants of the Minibatch version. From here on Gradient descent, SGD, Gradient step, all refer to the Mini-batch variant. We ll leave out the parameters! (#:#%&), ) (#:#%&) for simplicity. 48

44 Challenges and limitations of the plain SGD Choosing a proper learning rate. Sharing the learning rate for all parameters. Optimization in the face of highly non-convex functions. 49

45 Questions?

46 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 54

47 Novelties over the plain SGD Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton s method). 55

48 56

49 Momentum (Qian, N. 1999) Plain SGD can make erratic updates on non-smooth loss functions Consider an outlier example which throws off the learning process Need to maintain some history of updates. Physics example: A moving ball acquires momentum, at which point it becomes less sensitive to the direct force (gradient). 57

50 Momentum (Qian, N. 1999) Add a fraction! (usually about 0.9) of the update vector of the past time step to the current update vector. Faster convergence and reduced oscillations. " # =!" #%& + ( ) +, - - = - " # = -!" #%& ( ) +, -, - /012345" /8 - : ; <=>=?242>@ +, - A>=B5284 "234/> ( C2=>858A >=42 - = - ( ) +, - 58

51 Momentum (Qian, N. 1999) (a) SGD without momentum (b) SGD with momentum 59

52 Nestrov accelerated gradient (Nestrov, Y. 1983) Momentum is usually pretty high once we get near our goal point. The Algorithm has no idea when to slow down and therefore might miss the goal point. We would like our momentum to have a kind of foresight. 60

53 Momentum has no idea when to slow down 61

54 Nestrov accelerated gradient (Nestrov, Y. 1983) First make a jump based on our previous momentum, calculate the gradients and then make a correction. look ahead by calculating the gradient not w.r.t. to our current parameters but w.r.t. the approximate future position of our parameters.! " = $! "%& + ( ) +, - $! "%&, - /012345! /8 - = -! " - : ; <=>=?242>@ +, - A>=B5284!234/> ( C2=>858A >=42 - = - ( ) +, - 62

55 Nestrov accelerated gradient (Nestrov, Y. 1983) 63

56 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Now we are able to adapt our updates to the slope. But updates are the same for all the parameters being updated. We would like to adapt our updates to each individual parameter depending on their importance. 64

57 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Perform larger updates for infrequent parameters and smaller updates for frequent ones. Use a different learning rate for every parameter! ", at every time step #. well-suited for dealing with sparse data. Eliminates the need to manually tune the learning rate (most just use 0.01) 65

58 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :6 ;<=7893= ABCBD686CE ' ) *!CBF96=8 :6783C. G6BC=9=! CB86 * = *. 2 ' ) * 66

59 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :;<=>3?@= ABC>3?:C * 6 7 DEFEG=3=FH ' ) J=EFC?C! FE3= * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 3. 4 " diagonal matrix, 4 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. 9 prevents division by zero (in the order of 1e-8) 67

60 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ * "+, = * ".,! / ( +1 " ) * ;<=>?4@A> BCD?4@;D * 7 8 EFGFH>4>GI ' ) *!GFJ@>D4 A>?4;G. K>FGD@D! GF4> * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 4. 5 " diagonal matrix, 5 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. : prevents division by zero (in the order of 1e-8) 68

61 Adagrad vs. Plain SGD Adagrad:! "#$,& =! ",& ) $ * +,,, # ! ",& Plain SGD:! "#$,& =! ",& ). 0+ 1! ",& 69

62 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence. We need an efficient way of reducing this aggressive, monotonic, decreasing learning rate. 70

63 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) Recursively define a decaying average of those gradients. The running average at time step t depends only on the previous time step and the current gradient. 71

64 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 1, :;<6782<, F 1, "BAG85< B. H5AB<8<" BA75, =,. 0 F 1,; J, L & C2C5<7;C 625:. " $ = F N 1, $ 72

65 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 3, :7 ABCBD717CE G 3, "CBH97=1 :7814C. I7BC=9=" CB17, =,. 0 G 3,; K, M & D4D7=1<D 847;. " $ = G O 3, $! " # $ the running average of squared gradient, for time step 1. 2 prevents division by zero (in the order of 1e-8) 73

66 Adadelta (Zeiler, M, D. 2012) 74

67 Visualizations of the discussed algorithms 75

68 Questions?

69 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 77

70 Adam (Kingma D. P, & Ba. J 2014) RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, noisy, gradient. We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients. 78

71 Adam Update rule! " = $ %! "&% + (1 $ % ), ". - " = $. - "&% + (1 $. ), " / "0% = / " " + 2 3! " ; / <=>9?@A-9 BCD?@A<D / F G HIJI!9@9JK M ; /,JINA9D@ -9?@<J 1 O9IJDAD, JI@9 / = / 1 3 M ; /; Q, R, " = MS ; / "! " and - " are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. Recommended values in the paper are $ % = 0.9, $. = 0.999, 8 =

72 Adam Bias towards zero as! " and # " are initialized as 0 s, they are biased towards zero. Most significant in the initial steps. Most significant when $ % and $ & are close to 1. A correction is required. 80

73 Adam Bias correction!" # = " # 1 ' ( #, *+ # = + # 1 ', # :+7 ;<=89:4= ABCB"797CD F 3 - GCBH:7= C / I7BC=:=G CB97 - #.( = - # / 1 *+ # + 1 2!" # G # = FJ 3 - # Correct both moments to get the final update rule. 81

74 Adam vs. Adadelta Adam:! "#$ =! " ' $ () * #+, -. " 0! :;5671;! = >?@A@.464AB D 0! /A@E74; A ' F4@A;7;/ A@64 / " = D * 0! " Adadelta:! "#$ =! " ' $ ) * #+, / ". " = G $. "H$ + (1 G $ )/ " 8 " = G M 8 "H$ + (1 G M )/ " M 82

75 Adam - Performance 85

76 Adam - Performance 86

77 Adam - Performance 87

78 Questions?

79 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 89

80 improving Adam Ndam Incorporating Nestrov into Adam AdamW - decoupling weight decay! "#$ =! " ' $ () * #+, -. " '/ "! " AMSGrad fixing the exponential moving average 01 " = "4$, 1 " 1! "#$ =! " ' 01 " + 8,. " Maximum of past gradients instead of their exp. moving average Adam with warm restarts 90

81 Additional approaches Snapshot ensembles Learning to optimize 91

82 Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018) 92

83 Summary A brief walkthrough of Supervised Machine Learning. A conviction of the importance and relevance of Gradient methods. An Overview of modern Gradient descent optimization algorithms. The contribution of Adam. Innovations future to Adam. 93

84 94

85 Questions?

86 The End

87 links

88 references Ruder, S. (2016). An overview of gradient descent optimization algorithms. arxiv preprint arxiv: Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arxiv, pages 1 14, Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop, (1): , Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, pages 1 13, Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), 269: Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society, 12(1): , Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arxiv preprint arxiv: , Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul),

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST 1 An Introduction to Optimization Methods in Deep Learning Yuan YAO HKUST Acknowledgement Feifei Li, Stanford cs231n Ruder, Sebastian (2016). An overview of gradient descent optimization algorithms. arxiv:1609.04747.

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Haiwen Huang School of Mathematical Sciences Peking University, Beijing, 100871 smshhw@pku.edu.cn Chang Wang

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017 Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall 2017 Quiz 3 Quiz 3 Which of the following are necessary conditions for a value x to be a local minimum of a twice differentiable function

More information

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

arxiv: v1 [cs.lg] 16 Jun 2017

arxiv: v1 [cs.lg] 16 Jun 2017 L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University

More information

FIXING WEIGHT DECAY REGULARIZATION IN ADAM

FIXING WEIGHT DECAY REGULARIZATION IN ADAM FIXING WEIGHT DECAY REGULARIZATION IN ADAM Anonymous authors Paper under double-blind review ABSTRACT We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

NEGATIVE EIGENVALUES OF THE HESSIAN

NEGATIVE EIGENVALUES OF THE HESSIAN NEGATIVE EIGENVALUES OF THE HESSIAN IN DEEP NEURAL NETWORKS Guillaume Alain MILA University of Montreal guillaume.alain. umontreal@gmail.com Nicolas Le Roux Google Brain Montreal nicolas@le-roux.name Pierre-Antoine

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information