Deep Learning II: Momentum & Adaptive Step Size

Similar documents
Day 3 Lecture 3. Optimizing deep networks

Optimization for Training I. First-Order Methods Training algorithm

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

CS260: Machine Learning Algorithms

Deep Feedforward Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC 578 Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning & Artificial Intelligence WS 2018/2019

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Large-scale Stochastic Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Adam: A Method for Stochastic Optimization

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Deep Learning Lecture 2

Neural Networks: Optimization & Regularization

4. Multilayer Perceptrons

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Advanced Training Techniques. Prajit Ramachandran

CS60010: Deep Learning

Understanding Neural Networks : Part I

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

arxiv: v1 [cs.lg] 16 Jun 2017

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Data Mining & Machine Learning

ECS289: Scalable Machine Learning

Rapid Introduction to Machine Learning/ Deep Learning

Lecture 35: Optimization and Neural Nets

CS 343: Artificial Intelligence

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Deep Neural Networks (1) Hidden layers; Back-propagation

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Optimization for neural networks

Neural Networks. Lecture 2. Rob Fergus

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Reading Group on Deep Learning Session 1

Lecture 6 Optimization for Deep Neural Networks

DeepLearning on FPGAs

Introduction to Neural Networks

y(x n, w) t n 2. (1)

Stochastic Gradient Descent

CSC321 Lecture 15: Exploding and Vanishing Gradients

More on Neural Networks

Normalization Techniques

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Tips for Deep Learning

Neural Networks. Lecture 6. Rob Fergus

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

ECE521 Lecture 7/8. Logistic Regression

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks (Part 1) Goals for the lecture

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Accelerating Stochastic Optimization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Stochastic Gradient Descent (SGD) The Classical Convergence Thoerem

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Based on the original slides of Hung-yi Lee

ECS171: Machine Learning

Neural Networks and Deep Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

IMPROVING STOCHASTIC GRADIENT DESCENT

Based on the original slides of Hung-yi Lee

Neural networks COMS 4771

Comparison of Modern Stochastic Optimization Algorithms

Normalization Techniques in Training of Deep Neural Networks

Artificial Neuron (Perceptron)

Introduction to Convolutional Neural Networks (CNNs)

Adaptive Gradient Methods AdaGrad / Adam

Machine Learning Basics III

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSC321 Lecture 8: Optimization

Neural Network Training

ECS289: Scalable Machine Learning

Multi-layer Neural Networks

Machine Learning Basics

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Deep Learning (CNNs)

Tips for Deep Learning

CPSC 540: Machine Learning

Machine Learning Lecture 14

STA141C: Big Data & High Performance Statistical Computing

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Gradient Descent. Sargur Srihari

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lasso: Algorithms and Extensions

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Computational statistics

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lecture 4: Perceptrons and Multilayer Perceptrons

CSC321 Lecture 7: Optimization

Transcription:

Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1

Goals for the Lecture You should understand the following concepts: momentum Nesterov momentum adaptive step size AdaGrad RMSProp Adam Thanks Yujia Bao! 2

Background: Mini-batch gradient descent Procedure: Pick a mini-batch of examples from the data For each example: Forward propagation to get the output; Backward propagation to get the gradient; Update the weights using the average gradient. Common mini-batch size: 32, 64, 128. Note: Forward/Backward propagation can both be done simultaneously for many examples using matrix multiplication, and this is fast on GPUs!

Mini-batch gradient descent Recall the forward prop for a MLP: z "#$ = f "#$ W " z " + b " where f "#$ is the activation function for the (k + 1)st layer, W " i [j] is the weights from z " [j] to z "#$ [i] If we have n examples, we can define Z " = z " $ z " 3, B " = b " b " and express the forward prop as Z "#$ = f " W " Z " + B "

An example

Weight Update Methods Steepest Descent (Review) Momentum (Review) Nesterov Accelerated Gradient AdaGrad [Duchi et al., 2011] RMSprop [Tieleman & Hinton, 2012] Adam [Kingma & Ba, 2015] Element-wise adaptive learning rate

Steepest Descent Algorithm: Step 1. w (8#$) = w (8) η => =? @? (A) Very simple. Just one step. Poor convergence if the function surface is more steep in one dimension than in another. Let s see an example!

Steepest Descent Objective: L w = w $ C + 25w C C -50w C 2w $ Gradient: L = 2w $ 50w C Small over w $ direction, large over w C direction!

Steepest Descent Zigzag behavior! Move very slowly on the flat direction, oscillate on the steep direction.

Trick 7: Momentum Algorithm: Step 1.v (8#$) = μv (8) η => =? @? (A) Interpret v as velocity, 1 μ as friction. v stores the moving average of the gradients. Step 2. w (8#$) = w (8) + v (8#$) A ball rolling down from the surface of the loss function. Initialize v (J) to zero. Common settings for μ ranges between 0.5 to 0.99.

Trick 7: Momentum Accelerate convergence on the direction that the gradient keeps the same sign over the past few iterations. If μ is large, momentum will overshoot the target, but the overall convergence is still faster than steepest descent in practice. As μ goes smaller, momentum behaves more similar to steepest descent.

Nesterov Accelerated Gradient Momentum Nesterov Accelerated Gradient w (8#$) Negative gradient direction: η => =? @? (A) #LM (A) Speed direction: μv (8) Actual update Speed direction: μv (8) w (8#$) Actual update w (8) Negative gradient direction: η => =? @? (A) w (8)

An example Suppose w (8) is already optimal, but v (8) 0. Speed direction: μv (8) w (8) Momentum w (8#$) Actual update Negative gradient direction: 0 Nesterov Accelerated Gradient Speed direction: μv (8) w (8) w (8#$) Actual update Negative gradient direction: η => =? @? (A) #LM (A) Look ahead! The actual update is more reliable than Momentum.

Nesterov Accelerated Gradient Algorithm: Step 1.v (8#$) = μv (8) η => =? @? (A) #LM (A) Step 2.w (8#$) = w (8) + v (8#$) Nesterov Accelerated Gradient has better performance than momentum [Ilya et al., 2013].

Trick 8: Adaptive Learning Rate

Some Notes on our Notation in this Lecture For a scalar η and vector v, ηv is a vector Likewise P v is a vector, just as η $ v is a vector By between two vectors, we mean element-wise multiplication rather than dot product

AdaGrad (Adaptive Gradient) Algorithm: Step 1. G (8#$) = G (8) + => =? @? (A) C G is the sum of the square gradient. Step 2.w (8#$) = w (8) P => @ W (AXY) #$J[\ =?? (A) Scale the learning rate by 1/ G and follow the negative gradient direction. Common setting for η is 0.01. Different learning rate for each w (each w has its own G). Increase the learning rate on the flat direction. Decrease the learning rate on the steep direction.

AdaGrad (Adaptive Gradient) Consider the previous example: Gradient: L w = w $ C + 25w C C L = 2w $ 50w C Initial w (J) is 1, 1. η ( 2) w ($) = 1 1 2 C + 10^_ η ( 50) 50 C + 10^_ Will follow the diagonal direction! 1 1 + η η

AdaGrad (Adaptive Gradient) Algorithm: Step 1. G (8#$) = G (8) + => =? @? (A) C Step 2. w (8#$) = w (8) P => @ W (AXY) #$J[\ =?? (A) Why that slow? Learning rate goes down monotonically as t increases. Stop learning!!!

RMSProp (Root Mean Square Propagation) Algorithm: => @ C =?? (A) G is a moving exponential average of the square gradient. Step 1. G (8#$) = γg (8) + 1 γ Step 2. w (8#$) = w (8) P => @ W (AXY) #$J[\ =?? (A) Scale the learning rate by 1/ G and follow the negative gradient direction. Default settings: γ = 0.9, η = 0.001.

Another example (saddle point) Now consider minimizing L w = w $ e + w C C The problem is unbounded and (0,0) is a saddle point. At 0.01, 1, we have L = 0.0003. 2 How will these methods behave if we start at this point?

Adam (Adaptive Moment Estimation) Algorithm [Simplified Version]: Step 1. m (8#$) = β $ m (8) + 1 β $ => =? @? (A) m is the exponential moving average of the gradients (similar as the speed in the momentum method); => @ C =?? (A) v is the exponential moving average of the square gradients. Step 2. v (8#$) = β C v (8) + (1 β C ) Step 3. w (8#$) = w (8) P m(8#$) M AXY #$J[i Default settings: β $ = 0.9, β C = 0.999, η = 0.001.

Adam (Adaptive Moment Estimation) Algorithm [Complete Version]: Step 1.m (8#$) = β $ m (8) + 1 β $ => =? @? (A) Step 2. v (8#$) = β C v (8) + (1 β C ) => @ C =?? (A) Step 3. mj (8#$) = $ $^k Y A m (8#$) Step 4. vl (8#$) = $ $^k m A v (8#$) Bias Correction (since we set m (J), v (J) to 0) Step 5. w (8#$) = w (8) P Ml (AXY) #$J [i mj (8#$)

Adam (Adaptive Moment Estimation) Image credit: Kingma & Ba, 2015 Network configurations: Input layer: 28x28 2 hidden layers (1000 ReLUs each) Output: 10 SoftMax Cross-entropy loss. In practice, Adam is the default choice to use. [Fei- Fei Li, Andrej Karpathy, Justin Johnson]