Adam: A Method for Stochastic Optimization
|
|
- Rosanna Washington
- 5 years ago
- Views:
Transcription
1 Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by
2 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 2
3 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 3
4 4
5 5
6 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown. 7
7 Examples of instance and labels spaces X Y The space of all RGB images of some dimension Is there a cat in the image ({0,1}) The space of a stock s historical price sequences The stock s next day s closing price ([0, )) The space of all finite Chinese sentences The set of all finite English sentences The space of all information regarding two companies A probability of a merge to be successful The space of all MRI images A probability, location and type of a tumor The space of all finite length voice sequences The corresponding Amazon product being referred
8 Basic Supervised Machine Learning Terminology Notation Explanation! Instance space " Label space!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown 9
9 Basic Supervised Machine Learning Terminology Notation! Instance space " Label space Explanation!~$ unknown probability distribution %:! " True mapping between instance and label spaces, unknown {(* +, - + )} 2 +01, * +!, - + " Training set h:! ", h 5 Hypothesis, the object we wish to learn 10
10 Basic Supervised Machine Learning theory The goal is to find a hypothesis h that approximates " as best as possible, but approximates in what sense? We need a way for evaluating a hypothesis quality 11
11 Basic Supervised Machine Learning theory Let us define a loss function!: # # [0, ) For example: Zero-one loss +(h(. / ) 1 / ) Quadratic loss h(. / ) 1 / 3 Here we use: 1 / = 5(. / ) A measure of how bad the hypothesis did on a single sample 12
12 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. 13
13 Basic Supervised Machine Learning theory Let us also define the Generalization error (aka Risk) of a hypothesis h: " # h = % & '~# ) h * +, -(* + ) A measure of how bad the hypothesis did on a the entire instance space A good hypothesis is one with a low Risk value. Unfortunately 14
14 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. 15
15 Basic Supervised Machine Learning theory Unfortunately, the true Risk! " # cannot be computed because the distribution $ is unknown to the learning algorithm. We can, however, compute a proxy of the true Risk, called the Empirical Risk. 16
16 Basic Supervised Machine Learning theory Let us define the Empirical Risk of a hypothesis h. " #$% h = 1 ( ) * +,- / h 0 +, Recall: {(0 +, 3 + )} +,-., 0 + 9, 3 + : is the Training set " ; h = < = >~; / h 0 +, 2(0 + ) 17
17 Empirical Risk Minimization (ERM) strategy After mountains of theory (PAC Learning, VC Theory, etc.), the following theorem is proven (this is very unformal): The best strategy for a learning algorithm is to minimize the Empirical Risk 18
18 Empirical Risk Minimization strategy (ERM) This means that the learning algorithm defined by the ERM principle boils down to solving the following optimization problem: Find!h such that:!h = $%&'() *, -./0 h = $%&'() *, h : 5, < = 5 19
19 The hypothesis class can be very simple when! # $ and % { 1,1} + is class of all three dimensional hyper planes h =. / $ 2 $ 4 = {. /,. 1,. 3,. $ } # 5 21
20 The hypothesis class can be very complex Resent year s Deep learning architectures, result in models with tens of millions of parameters,! # $(&'() 22
21 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. 4 1!h = $%&'() *,. / h 6 1,
22 The bottom line Machine Learning applications require us to find good local minima of extremely high dimensional, noisy, non-convex functions. We need a principal method for achieving this goal. 4 1!h = $%&'() *,. / h 6 1,
23 Introducing The Gradient method 25
24 Questions?
25 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 27
26 A couple of notes before we head on I stick to the papers notations. From here we ll use! ", which is the same as # $%& h. 28
27 Introducing The Gradient method Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & ( ) arbitratily General Step: Set % *+, = % *! 0 1 % * if 0 1 % *+, ", then STOP and % *+, is the output 29
28 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2'
29 Gradient descent example simple Linear regression! = #, $ = # {(' (, * ( )} / (-., ' (!, * ( $ 1 = 2' #, 4 #}, 5 = (2, 4) h = 2' + 4 The goal is to find good 2 and 4 values 31
30 Gradient descent example simple Linear regression h = #$ + & ' h $, ) = ) h($) -. /01 h = ' h $ 6, 8 ) 6 = ) 6 h $ 6 - = ) 6 (#$ 6 + &) - 32
31 GD example computing the gradient For!: " "# $ %#& h = " ) "# * + * -.) / - (! ) 6 = ) * + * " -.) "# / - (! ) 6 = ) * -.) * 2 / - (! ) ( 2 - ) = 8 * 8! $ = 2 9 : -.) 2 - / - (! ) Similarly for 4: 8 * 84 $ = 2 9 : -.) / - (! ) 33
32 GD example computing the gradient So the gradient vector is: " # $ = & &' #, & &) # 1 = 2, -./ '2. + ), 2, -./ ('2. + )) 34
33 GD example the complete algorithm Input: learning rate!, tolerance parameter " > 0 Initialization: pick % & = (m, b). / arbitratily General Step: Set % 012 = % 0! 5 6 % = 7 0 +! / 9 ;<2 9 = ; > ; 7= ; +?? 012 =? 0 +! / 9 ;<2 9 > ; 7= ; +? if 5 6 % 012 ", then STOP and % 012 is the output 35
34 GD example visualized 36
35 Variants of Gradient descent Differ in how much data we use to compute the gradients of the objective function. A trade-off between the accuracy of the update and the computation time per update. 38
36 Batch Gradient descent Computes the gradients of the function w.r.t the parameters! for the entire training dataset.! =! $ % ' (!; * (,:.), 1 (,:.) 39
37 Batch Gradient descent Pros and Cons Pros Guaranteed to converge to global/local minimum. An unbiased estimate of gradients. Cons Possibly slow or impossible to compute. Some examples may be redundant. Converges to the minimum of the basin the parameters are placed in. 40
38 Stochastic Gradient descent (SGD) Computes the gradients of the function w.r.t the parameters! for a single training sample.! =! $ % ' (!; * (,), / (,) 41
39 Stochastic Gradient descent Pros and Cons Pros Much faster to compute. Potential to jump to better basins (and better local minima). Cons High variance that causes the objective to fluctuate heavily. 42
40 Mini-batch Gradient descent Computes the gradients of the function w.r.t the parameters! for a mini-batch of " training sample.! =! % & ( )!; + (-:-/0), 3 (-:-/0) " is usually 32 to
41 Mini-batch Gradient descent Pros and Cons the best of both worlds - fast, explorational and allows for stable convergence. Makes use of highly optimized matrix optimizations libraries and hardware. The method of for most Supervised Machine leaning scenarios. 44
42 Variants of Gradient descent - visualizations 45
43 Its all Mini-batch from here The remaining of the presentation will focus on variants of the Minibatch version. From here on Gradient descent, SGD, Gradient step, all refer to the Mini-batch variant. We ll leave out the parameters! (#:#%&), ) (#:#%&) for simplicity. 48
44 Challenges and limitations of the plain SGD Choosing a proper learning rate. Sharing the learning rate for all parameters. Optimization in the face of highly non-convex functions. 49
45 Questions?
46 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 54
47 Novelties over the plain SGD Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp We will focus only on algorithms that are feasible to compute in practice for high dimensional data sets (and will ignore second-order methods such as Newton s method). 55
48 56
49 Momentum (Qian, N. 1999) Plain SGD can make erratic updates on non-smooth loss functions Consider an outlier example which throws off the learning process Need to maintain some history of updates. Physics example: A moving ball acquires momentum, at which point it becomes less sensitive to the direct force (gradient). 57
50 Momentum (Qian, N. 1999) Add a fraction! (usually about 0.9) of the update vector of the past time step to the current update vector. Faster convergence and reduced oscillations. " # =!" #%& + ( ) +, - - = - " # = -!" #%& ( ) +, -, - /012345" /8 - : ; <=>=?242>@ +, - A>=B5284 "234/> ( C2=>858A >=42 - = - ( ) +, - 58
51 Momentum (Qian, N. 1999) (a) SGD without momentum (b) SGD with momentum 59
52 Nestrov accelerated gradient (Nestrov, Y. 1983) Momentum is usually pretty high once we get near our goal point. The Algorithm has no idea when to slow down and therefore might miss the goal point. We would like our momentum to have a kind of foresight. 60
53 Momentum has no idea when to slow down 61
54 Nestrov accelerated gradient (Nestrov, Y. 1983) First make a jump based on our previous momentum, calculate the gradients and then make a correction. look ahead by calculating the gradient not w.r.t. to our current parameters but w.r.t. the approximate future position of our parameters.! " = $! "%& + ( ) +, - $! "%&, - /012345! /8 - = -! " - : ; <=>=?242>@ +, - A>=B5284!234/> ( C2=>858A >=42 - = - ( ) +, - 62
55 Nestrov accelerated gradient (Nestrov, Y. 1983) 63
56 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Now we are able to adapt our updates to the slope. But updates are the same for all the parameters being updated. We would like to adapt our updates to each individual parameter depending on their importance. 64
57 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011) Perform larger updates for infrequent parameters and smaller updates for frequent ones. Use a different learning rate for every parameter! ", at every time step #. well-suited for dealing with sparse data. Eliminates the need to manually tune the learning rate (most just use 0.01) 65
58 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :6 ;<=7893= ABCBD686CE ' ) *!CBF96=8 :6783C. G6BC=9=! CB86 * = *. 2 ' ) * 66
59 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ ) * :;<=>3?@= ABC>3?:C * 6 7 DEFEG=3=FH ' ) J=EFC?C! FE3= * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 3. 4 " diagonal matrix, 4 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. 9 prevents division by zero (in the order of 1e-8) 67
60 Adagrad (Duchi, J., Hazan, E., & Singer, Y. 2011)! ",$ = '( ) * ",$, * "+,,$ = * ",$. 2! / (,00 +1 ",$ * "+, = * ".,! / ( +1 " ) * ;<=>?4@A> BCD?4@;D * 7 8 EFGFH>4>GI ' ) *!GFJ@>D4 A>?4;G. K>FGD@D! GF4> * = *. 2 ' ) *! ",$ - the gradient w.r.t the parameter * $, at time step 4. 5 " diagonal matrix, 5 ",$$ is the sum of squares gradients w.r.t * $ up to time step t. : prevents division by zero (in the order of 1e-8) 68
61 Adagrad vs. Plain SGD Adagrad:! "#$,& =! ",& ) $ * +,,, # ! ",& Plain SGD:! "#$,& =! ",& ). 0+ 1! ",& 69
62 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) In Adagrad, the accumulated sum keeps growing. This causes the learning rate to shrink and become infinitesimally small, impeding convergence. We need an efficient way of reducing this aggressive, monotonic, decreasing learning rate. 70
63 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton) Recursively define a decaying average of those gradients. The running average at time step t depends only on the previous time step and the current gradient. 71
64 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 1, :;<6782<, F 1, "BAG85< B. H5AB<8<" BA75, =,. 0 F 1,; J, L & C2C5<7;C 625:. " $ = F N 1, $ 72
65 Adadelta (Zeiler, M, D. 2012) and RMSProp (Hinton)! " # $ = &! " # # $'( + 1 & " $ 1, $-( =, $.! " # $ + / 0 " $ 3, :7 ABCBD717CE G 3, "CBH97=1 :7814C. I7BC=9=" CB17, =,. 0 G 3,; K, M & D4D7=1<D 847;. " $ = G O 3, $! " # $ the running average of squared gradient, for time step 1. 2 prevents division by zero (in the order of 1e-8) 73
66 Adadelta (Zeiler, M, D. 2012) 74
67 Visualizations of the discussed algorithms 75
68 Questions?
69 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 77
70 Adam (Kingma D. P, & Ba. J 2014) RMSProp allows for a adaptive per-parameter update, but the update itself is still done using the current, noisy, gradient. We would like the gradient itself to be replaced by a similar exponential decaying average of past gradients. 78
71 Adam Update rule! " = $ %! "&% + (1 $ % ), ". - " = $. - "&% + (1 $. ), " / "0% = / " " + 2 3! " ; / <=>9?@A-9 BCD?@A<D / F G HIJI!9@9JK M ; /,JINA9D@ -9?@<J 1 O9IJDAD, JI@9 / = / 1 3 M ; /; Q, R, " = MS ; / "! " and - " are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. Recommended values in the paper are $ % = 0.9, $. = 0.999, 8 =
72 Adam Bias towards zero as! " and # " are initialized as 0 s, they are biased towards zero. Most significant in the initial steps. Most significant when $ % and $ & are close to 1. A correction is required. 80
73 Adam Bias correction!" # = " # 1 ' ( #, *+ # = + # 1 ', # :+7 ;<=89:4= ABCB"797CD F 3 - GCBH:7= C / I7BC=:=G CB97 - #.( = - # / 1 *+ # + 1 2!" # G # = FJ 3 - # Correct both moments to get the final update rule. 81
74 Adam vs. Adadelta Adam:! "#$ =! " ' $ () * #+, -. " 0! :;5671;! = >?@A@.464AB D 0! /A@E74; A ' F4@A;7;/ A@64 / " = D * 0! " Adadelta:! "#$ =! " ' $ ) * #+, / ". " = G $. "H$ + (1 G $ )/ " 8 " = G M 8 "H$ + (1 G M )/ " M 82
75 Adam - Performance 85
76 Adam - Performance 86
77 Adam - Performance 87
78 Questions?
79 Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations of SGD Earlier approaches/building blocks Momentum Nestrov accelerated gradient (NAG) AdaGrad (Adaptive Gradient) AdaDelta and RMSProp Adam Update rule Bias correction AdaMax Post Adam innovations Improving Adam Additional approaches Shampoo 89
80 improving Adam Ndam Incorporating Nestrov into Adam AdamW - decoupling weight decay! "#$ =! " ' $ () * #+, -. " '/ "! " AMSGrad fixing the exponential moving average 01 " = "4$, 1 " 1! "#$ =! " ' 01 " + 8,. " Maximum of past gradients instead of their exp. moving average Adam with warm restarts 90
81 Additional approaches Snapshot ensembles Learning to optimize 91
82 Shampoo (Gupta, V., Koren, T., & Singer, Y. 2018) 92
83 Summary A brief walkthrough of Supervised Machine Learning. A conviction of the importance and relevance of Gradient methods. An Overview of modern Gradient descent optimization algorithms. The contribution of Adam. Innovations future to Adam. 93
84 94
85 Questions?
86 The End
87 links
88 references Ruder, S. (2016). An overview of gradient descent optimization algorithms. arxiv preprint arxiv: Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arxiv, pages 1 14, Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLRWorkshop, (1): , Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, pages 1 13, Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), 269: Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society, 12(1): , Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arxiv preprint arxiv: , Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul),
Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationAn Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST
1 An Introduction to Optimization Methods in Deep Learning Yuan YAO HKUST Acknowledgement Feifei Li, Stanford cs231n Ruder, Sebastian (2016). An overview of gradient descent optimization algorithms. arxiv:1609.04747.
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationUnderstanding Neural Networks : Part I
TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks
More informationNostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate
Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Haiwen Huang School of Mathematical Sciences Peking University, Beijing, 100871 smshhw@pku.edu.cn Chang Wang
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationOptimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum
Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationHamiltonian Descent Methods
Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationONLINE VARIANCE-REDUCING OPTIMIZATION
ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationUNDERSTANDING LOCAL MINIMA IN NEURAL NET-
UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationRegularization and Optimization of Backpropagation
Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationNeural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017
Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall 2017 Quiz 3 Quiz 3 Which of the following are necessary conditions for a value x to be a local minimum of a twice differentiable function
More informationEVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)
EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationNon-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I
Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationAdvanced Training Techniques. Prajit Ramachandran
Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise
More informationarxiv: v1 [cs.lg] 16 Jun 2017
L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationThe K-FAC method for neural network optimization
The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,
More informationNormalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training
Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University
More informationFIXING WEIGHT DECAY REGULARIZATION IN ADAM
FIXING WEIGHT DECAY REGULARIZATION IN ADAM Anonymous authors Paper under double-blind review ABSTRACT We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationMore Tips for Training Neural Network. Hung-yi Lee
More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)
More informationNEGATIVE EIGENVALUES OF THE HESSIAN
NEGATIVE EIGENVALUES OF THE HESSIAN IN DEEP NEURAL NETWORKS Guillaume Alain MILA University of Montreal guillaume.alain. umontreal@gmail.com Nicolas Le Roux Google Brain Montreal nicolas@le-roux.name Pierre-Antoine
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More information