Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Similar documents
Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Deep Feedforward Networks

CSC321 Lecture 8: Optimization

Optimization for neural networks

CSC321 Lecture 7: Optimization

Optimization for Training I. First-Order Methods Training algorithm

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Day 3 Lecture 3. Optimizing deep networks

Stochastic Gradient Descent

CS260: Machine Learning Algorithms

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning & Artificial Intelligence WS 2018/2019

Adam: A Method for Stochastic Optimization

Lecture 6 Optimization for Deep Neural Networks

Gradient Descent. Sargur Srihari

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

OPTIMIZATION METHODS IN DEEP LEARNING

ECS171: Machine Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Regularization and Optimization of Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Adaptive Gradient Methods AdaGrad / Adam

Neural Networks: Optimization & Regularization

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

CSC2541 Lecture 5 Natural Gradient

Large-scale Stochastic Optimization

STA141C: Big Data & High Performance Statistical Computing

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Rapid Introduction to Machine Learning/ Deep Learning

Nonlinear Optimization Methods for Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

More Tips for Training Neural Network. Hung-yi Lee

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Advanced Training Techniques. Prajit Ramachandran

Non-convex optimization. Issam Laradji

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Computational statistics

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Tips for Deep Learning

Logistic Regression. Stochastic Gradient Descent

Optimization Methods for Machine Learning

Normalization Techniques in Training of Deep Neural Networks

CSC 578 Neural Networks and Deep Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

ECS289: Scalable Machine Learning

IMPROVING STOCHASTIC GRADIENT DESCENT

Understanding Neural Networks : Part I

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Tips for Deep Learning

Machine Learning Basics III

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Comparison of Modern Stochastic Optimization Algorithms

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Notes on AdaGrad. Joseph Perla 2014

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Accelerating Stochastic Optimization

Learning with Momentum, Conjugate Gradient Learning

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Mini-Course 1: SGD Escapes Saddle Points

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Computational Photography

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Optimization and Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Based on the original slides of Hung-yi Lee

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Least Mean Squares Regression

Lecture 3: Deep Learning Optimizations

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Stochastic Analogues to Deterministic Optimizers

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Variational Autoencoders

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

Why should you care about the solution strategies?

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

ECS289: Scalable Machine Learning

More Optimization. Optimization Methods. Methods

Deep Learning Lecture 2

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning. Lecture 2: Linear regression. Feng Li.

Transcription:

Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/

Problem Statement Machine Learning Optimization Problem Training samples: x i, y i Cost function: J(θ; X; Y) = i d(f(x (i) ; θ), y (i) ) θ = arg min θ J(θ; X; Y)

Optimization method Gradient Descent The most common way to optimize neural networks Deep learning library contains implementations of various gradient descent algorithms To minimize an objective function J(θ) parameterized by a model's parameters θ R d by updating the parameters in the opposite direction of the gradient of the objective function θ J(θ) with respect to the parameters.

CONTENTS

Gradient descent variants Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Challenges

Gradient descent optimization algorithms Momentum Adaptive Gradient Visualization Which optimizer to choose? Additional strategies for optimizing SGD Shuffling and Curriculum Learning Batch normalization

GRADIENT DESCENT VARIANTS

Batch gradient descent Computes the gradient of the cost function w.r.t. θ for the entire training dataset: θ new = θ old η θ J(θ; X; Y) Properties Very slow Intractable for datasets that don't fit in memory No online learning

Stochastic Gradient descent (SGD) To perform a parameter update for each training example x (i) and label y (i) θ new = θ old η θ J(θ; x i ; y (i) ) Properties: Faster Online learning Heavy fluctuation Capability to jump to new (potentially better local minima) Complicated convergence (overshooting)

SGD fluctuation

Batch Gradient vs SGD Batch gradient It converges to the minimum of the basin the parameters are placed in. Stochastic gradient descent It is able to jump to new and potentially better local minima. This complicates convergence to the exact minimum, as SGD will keep overshooting

Mini-batch (stochastic) gradient descent To perform an update for every mini-batch of n training examples: θ new = θ old η θ J(θ; x i:i+n ; y (i:i+n) )

Properties of mini-batch gradient descent Compared with SGD It reduces the variance of the parameter updates, which can lead to more stable convergence; It can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.

CHALLENGES

Challenges Choosing a proper learning rate can be difficult. Small learning rate leads to painfully slow convergence Large learning rate can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge Learning rate schedules and thresholds It has to be defined in advance and unable to adapt to a dataset's characteristics. Same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.

Challenges Avoiding getting trapped in their numerous suboptimal local minima and saddle points Dauphin et al. argue that the difficulty arises in fact not from local minima but from saddle points. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

MOMENTUM

Momentum One of the main limitations of gradient descent) is local minima When the gradient descent algorithm reaches a local minimum, the gradient becomes zero and the weights converge to a sub-optimal solution A very popular method to avoid local minima is to compute a temporal average direction in which the weights have been moving recently An easy way to implement this is by using an exponential average v t = γv t 1 + η θ J(θ) θ new = θ old v t The term γ is called the momentum The momentum has a value between 0 and 1 (typically 0.9) Properties Fast convergence Less oscillation

Momentum Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

Momentum The momentum term is also useful in spaces with long ravines characterized by sharp curvature across the ravine and a gently sloping floor Sharp curvature tends to cause divergent oscillations across the ravine To avoid this problem, we could decrease the learning rate, but this is too slow The momentum term filters out the high curvature and allows the effective weight steps to be bigger It turns out that ravines are not uncommon in optimization problems, so the use of momentum can be helpful in many situations However, a momentum term can hurt when the search is close to the minima (think of the error surface as a bowl) As the network approaches the bottom of the error surface, it builds enough momentum to propel the weights in the opposite direction, creating an undesirable oscillation that results in slower convergence

Smarter Ball?

NGD (Nesterov accelerated gradient) Nesterov accelerated gradient improved on the basis of Momentum algorithm Approximation of the next position of the parameters. v t = γv t 1 + η θ J(θ γv t 1 ). θ new = θ old v t

ADAPTIVE GRADIENTS

Adaptive Gradient Methods Let us adapt the learning rate of each parameter, performing larger updates for infrequent and smaller updates for frequent parameters. Methods AdaGrad (Adaptive Gradient Method) AdaDelta RMSProp (Root Mean Square Propagation) Adam (Adaptive Moment Estimation)

Adaptive Gradient Methods These methods use a different learning rate for each parameter θ i R at every time step t. For brevity, we set g i (t) to be the gradient of the objective function w.r.t. θ i R at time step t: θ i (t+1) = θi (t) η gi (t) These methods modify the learning rate η at each time step (t) for every parameter θ i based on the past gradients that have been computed for θ i.

Adagrad Adagrad modifies the general learning rate η at each time step t for every parameter θ i based on the past gradients that have been computed for θ i : (t+1) (t) η θ i = θi G t,i + ε g t,i G t,i = k t g i k 2 g i (k) = J(θ) θ i θ (k)

Adagrad Pros It eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01. Cons Its accumulation of the squared gradients in the denominator: the accumulated sum keeps growing during training. This causes the learning rate to shrink and eventually become infinitesimally small. The following algorithms aim to resolve this flaw.

RMSprop RMSprop has been developed to resolve Adagrad's diminishing learning rates. λ t,i = γ λ t 1,i + 1 γ 2 t g i (t+1) (t) η θ i = θi λ t,i + ε g t,i η : Learning rate is suggest to set to 0.001 γ : Fixed momentum term

Adam (Adaptive moment Estimation) Adam keeps an exponentially decaying average of past gradients m t, similar to momentum. m t,i = β 1 m t 1,i + (1 β 1 )g i t v t,i = β 2 v t 1,i + 1 β 2 They counteract these biases by computing bias-corrected first and second moment estimates. m t,i = m t,i v t,i = v t,i 1 β 1 1 β 2 g i t 2 m t : Estimate of the first moment(the mean) v t : Estimate of the second moment (the un-centered variance) β 1 : suggest to set to 0.9 β 2 : suggest to set to 0.999 which yields the Adam update rule. θ i (t+1) = θi (t) η v t + ε m t,i ε : suggest to set to 10 8

COMPARISON

Visualization of algorithms

Which optimizer to choose? RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser.

CONCLUSION

Conclusion Three variants of gradient descent, among which mini-batch gradient descent is the most popular.