Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Similar documents
Optimization for Training I. First-Order Methods Training algorithm

Day 3 Lecture 3. Optimizing deep networks

Deep Learning & Artificial Intelligence WS 2018/2019

Nonlinear Optimization Methods for Machine Learning

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

The K-FAC method for neural network optimization

Deep Learning II: Momentum & Adaptive Step Size

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

ECS289: Scalable Machine Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Normalization Techniques in Training of Deep Neural Networks

Large-scale Stochastic Optimization

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

CSC2541 Lecture 5 Natural Gradient

Stochastic Gradient Descent

CS260: Machine Learning Algorithms

CSC321 Lecture 7: Optimization

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

ECS289: Scalable Machine Learning

Advanced Training Techniques. Prajit Ramachandran

Lecture 6 Optimization for Deep Neural Networks

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CSC321 Lecture 8: Optimization

Optimization for neural networks

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

ONLINE VARIANCE-REDUCING OPTIMIZATION

Adaptive Gradient Methods AdaGrad / Adam

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Deep Feedforward Networks

Adam: A Method for Stochastic Optimization

New Insights and Perspectives on the Natural Gradient Method

STA141C: Big Data & High Performance Statistical Computing

arxiv: v1 [cs.lg] 16 Jun 2017

Comparison of Modern Stochastic Optimization Algorithms

Neural Networks: Optimization & Regularization

Normalization Techniques

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

OPTIMIZATION METHODS IN DEEP LEARNING

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

ECS171: Machine Learning

Lecture 3: Deep Learning Optimizations

Accelerating Stochastic Optimization

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Natural Gradients via the Variational Predictive Distribution

Auto-Encoding Variational Bayes

Stochastic Optimization Algorithms Beyond SG

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Nonparametric Inference for Auto-Encoding Variational Bayes

Summary and discussion of: Dropout Training as Adaptive Regularization

More on Neural Networks

CSC321 Lecture 9: Generalization

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Neural Networks

Regularization and Optimization of Backpropagation

Deep Learning Lecture 2

Deep Learning & Neural Networks Lecture 4

Understanding Neural Networks : Part I

CSC321 Lecture 9: Generalization

Notes on AdaGrad. Joseph Perla 2014

J. Sadeghi E. Patelli M. de Angelis

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Based on the original slides of Hung-yi Lee

Ad Placement Strategies

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Stochastic Gradient Descent (SGD) The Classical Convergence Thoerem

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

Why should you care about the solution strategies?

IMPROVING STOCHASTIC GRADIENT DESCENT

More Optimization. Optimization Methods. Methods

Learning Energy-Based Models of High-Dimensional Data

The connection of dropout and Bayesian statistics

Gradient Descent. Sargur Srihari

Stochastic Optimization: First order method

Adversarial Examples

Stochastic gradient descent; Classification

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Bayesian Deep Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Computational Photography

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH

Learning with Momentum, Conjugate Gradient Learning

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Tips for Deep Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Introduction to Convolutional Neural Networks (CNNs)

Transcription:

Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba

Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation efficiency

Neural networks loss: Consider an input and output pair (x,y) from a set of N training examples. prediction: Neural networks are parameterized functions that map input x to output prediction y using d number of weights. Compare the output prediction with the correct answer to obtain a loss function for the current weights: Averaged loss: input: Shorthand notation target:

Why is learning difficult Neural networks are non-convex. Many local optimums convex set non-convex set convex function non-convex function

Why is learning difficult Neural networks are non-convex. Neural networks are not smooth. Lipschitz-continuous Not Lipschitz-continuous The change in f is smooth and bounded. L is the Lipschitz constant

Why is learning difficult Neural networks are non-convex. Neural networks are not smooth. Millions of trainable weights....... ~5 million ImageNet Neural Machine Translation ~100 million ~40 million? AlphaGo

How to train neural networks with random search loss: Consider perturbing the weights of a neural network by a random vector. Evaluate the perturbed averaged loss over the training examples. Keep the perturbation otherwise. prediction: if loss improves, discard Repeat input: target:

How to train neural networks with random search loss: Consider perturbing the weights of a neural network by a random vector. Evaluate the perturbed averaged loss over the training examples. prediction: Add the perturbation weighted by the perturbed loss to the current weights Repeat input: target:

How to train neural networks with random search loss: Consider perturbing the weights of a neural network by M random vector. Evaluate the perturbed averaged loss over the training examples. prediction: Add the perturbation weighted by the perturbed loss to the current weights Repeat input: target:

How to train neural networks with random search loss: Consider perturbing the weights of a neural network by M random vector. Evaluate the perturbed averaged loss over the training examples. prediction: Add the perturbation weighted by the perturbed loss to the current weights Repeat This family of algorithms has many names: Evolution Strategy/Zero order method/derivative-free method Source: Saliman et. al, 2017, Flaxman et al, 2005, Agarwal, et al. 2010, Nesterov, 2011, Berahas, et. al, 2018 input: target:

How well does random search work? Training a 784-512-512-10 neural network on MNIST (0.5 million weights): 1600 samples vs 5000 samples vs backprop Source: Wen, et. al, 2018

How well does random search work? Random search approximate finite difference with stochastic samples: Why does random research work? Each perturbation gives a directional gradient w2 This seems really inefficient w.r.t. the number of weights. (come back to this later) Aggregated search direction

Gradient descent and back-propagation Using random search, computing d directional gradients requires d forward propagation. Although it can be done in parallel, it would be more efficient if we could directly query gradient information. We can obtain the full gradient information on continuous loss function more efficiently by back-prop. w2

Gradient descent To justify our search direction choice, we formulate the following optimization problem: Consider a small change to the weights that will minimize the averaged loss w.r.t. the weights. The small change is formulated as a constraint in the Euclidean space:

Gradient descent To justify our search direction choice, we formulate the following optimization problem: Consider a small change to the weights that will minimize the averaged loss w.r.t. the weights. The small change is formulated as a constraint in the Euclidean space: Linearize the loss function and solve the Lagrangian to get the update rule:

How well does gradient descent work? Let L be the Lipschitz constant of the gradient: Assume convexity, gradient descent converges in 1/T: w2

How well does gradient descent work? Gradient descent can be inefficient under ill-conditioned curvature. How to improve gradient descent? w2

Momentum: smooth gradient with moving average Keep a running average of the gradient updates can smooth out the zig-zag behavior: Nesterov s accelerated gradient: Lookahead trick Assume convexity, NAG obtains 1/T^2 rate

Stochastic gradient descent: improve efficiency loss: Averaged loss: Computing averaged loss requires going through the entire training dataset. prediction: Maybe we do not have to go over millions of images for a small weight update. Sample B points from the whole training set. mini-batch loss: N data points input: target:

Stochastic gradient descent: improve efficiency Sample B points from the whole training set. mini-batch loss: Without losing generality, assume strong convexity, B = 1 and decay learning rate with SGD obtains convergence rate of

Stochastic gradient descent: improve efficiency Sample B points from the whole training set. mini-batch loss: Without losing generality, assume strong convexity, B = 1 and decay learning rate with SGD obtains convergence rate of Are we making improvement over random search at all? Yes! Random search with two-point methods has a rate of (Duchi, et. al. 2015)

Revisit gradient descent Constrained optimization problems for different gradient-based algorithms: Objective: Gradient descent: Are there other constraints we can use here?

Revisit gradient descent Constrained optimization problems for different gradient-based algorithms: Objective: Gradient descent: Natural gradient (Amari, 1998):

Natural gradient descent Again, let us consider a small change to the weights that will minimize the averaged loss w.r.t. the weights. A small change is formulated as a constraint in the Probability space using KL divergence: Linearize the model and take the second-order taylor expansion around the current weights:

Fisher information matrix loss: prediction: Fisher information matrix input: target:

Fisher information matrix Geodesic perspective: distance measure in weight space v.s. model output space

When first-order methods fails w1 w2 SGD path better gradient estimation does not overcome correlated ill-conditioned curvature in the weight space

Second-order optimization algorithms w1 Question: how to find the best preconditioning matrix? Natural gradient corrects for the ill-conditioned curvature using a preconditioning matrix. w2 Amari, 1998

Second-order method algorithms Constrained optimization problems for different gradient-based algorithms: Objective: Gradient descent: Natural gradient: Family of second-order approximations:

Find a good preconditioning matrix Intuitively, we would like to find a preconditioning matrix A focusing on the weight space that has not been explored much from the previous updates: Consider the following optimization problem to find the preconditioner: In practice, most of the algorithms like AdaGrad, Adam, RMSProp, Tonga uses diagonal approximation to make this preconditioning tractable.

When first-order methods work well w1 Adam: adapt the learning rate for each parameters differently small variance large variance w2 SGD path SGD takes noisy gradient steps Kingma and Ba, ICLR 2015

When first-order methods work well w1 speed up learning for parameters with low variance gradients w2 Adam path slow down learning along the high variance direction Kingma and Ba, ICLR 2015

When first-order methods fail w1 We need full curvature matrix to correct for the correlation among the weights. w2

Learning on a single machine SGD updates Commonly used SGD variants: AdaGrad, Duchi et al., COLT 2010 Adam, Kingma and Ba, ICLR 2015 Nesterovs momentum Sutskever et al., ICML 2013 noisy gradient of the loss function on a random mini-batch of data model N data points data batch size (Bz) of B

Where does the speedup comes from? 1. More accurate gradient estimation 2. More frequent updates Distributed learning SGD updates N data points better gradient estimation Bz 2048 Bz 1024 Bz 256... data data data diminishing returns # of training example consumed

Here is the training plot of a state-of-the-art ResNet trained on 8 GPUs: Amazon EC2 instance cost

Scalability of the black-box optimization algorithms SGD natural gradient poor great great poor Scalability with additional computational resources Scalability with the number of weights Two problems: 1. Memory inefficient 2. Inverse intractable

Background: Natural gradient for neural networks Gradient computation: Insight: The flattened gradient matrix in a layer has a Kronecker product structure. This is equivalent to:

Background: Natural gradient for neural networks Kronecker product definition: Gradient computation: Insight: The flattened gradient matrix in a layer has a Kronecker product structure. Tile the larger matrix with the rescaled smaller matrices. This is equivalent to:

Background: Natural gradient for neural networks Fisher information matrix Insight: The flattened gradient matrix in a layer has a Kronecker product structure.

Background: Kronecker-factored natural gradient Fisher information matrix Martens and Grosse, ICML 2015

Background: Kronecker-factored natural gradient Kronecker factors memory efficient Fisher information matrix Martens and Grosse, ICML 2015

Background: Kronecker-factored natural gradient Kronecker factors memory efficient Fisher information matrix Problem: It is still computationally expensive comparing to SGD tractable inverse Martens and Grosse, ICML 2015

Distributed K-FAC natural gradient K-FAC updates natural gradient inverse workers gradients workers stats workers... data data... data data data Ba et al., ICLR 2017

Scalability experiments Scientific question: How well does distributed K-FAC scale with more parallel computational resources comparing to distributed SGD? dist. SGD dist. K-FAC diminishing returns Bz 2048 Bz 1024 Bz 256 better gradient estimation almost linear speedup!! # of training example consumed Ba et al., ICLR 2017

Second-order method algorithms Constrained optimization problems for different gradient-based algorithms: Objective: Gradient descent: Natural gradient: Family of second-order approximations:

Thank you