OPTIMIZATION METHODS IN DEEP LEARNING

Similar documents
Optimization for neural networks

Day 3 Lecture 3. Optimizing deep networks

Deep Feedforward Networks

Neural Networks: Optimization & Regularization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 6 Optimization for Deep Neural Networks

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Parameter Norm Penalties. Sargur N. Srihari

Lecture 5: Logistic Regression. Neural Networks

Optimization for Training I. First-Order Methods Training algorithm

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Based on the original slides of Hung-yi Lee

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Deep Learning & Artificial Intelligence WS 2018/2019

Adam: A Method for Stochastic Optimization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to gradient descent

Nonlinear Optimization Methods for Machine Learning

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

A picture of the energy landscape of! deep neural networks

CSC2541 Lecture 5 Natural Gradient

Gradient Descent. Sargur Srihari

Deep Feedforward Networks

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks and Deep Learning

Regularization in Neural Networks

Neural Networks: Backpropagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

More Optimization. Optimization Methods. Methods

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

CSC321 Lecture 9: Generalization

CSC321 Lecture 8: Optimization

CSC321 Lecture 9: Generalization

Normalization Techniques in Training of Deep Neural Networks

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Regularization and Optimization of Backpropagation

Tips for Deep Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Regression with Numerical Optimization. Logistic

Machine Learning Basics III

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Big Data Analytics. Lucas Rego Drumond

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Why should you care about the solution strategies?

Introduction to Neural Networks

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Tips for Deep Learning

CSC321 Lecture 7: Optimization

J. Sadeghi E. Patelli M. de Angelis

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

ECS171: Machine Learning

ECS289: Scalable Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Models for Regression

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Midterm exam CS 189/289, Fall 2015

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

Normalization Techniques

Statistical Data Mining and Machine Learning Hilary Term 2016

Optimization and (Under/over)fitting

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Introduction to Convolutional Neural Networks (CNNs)

Linear classifiers: Overfitting and regularization

Deep Learning Lab Course 2017 (Deep Learning Practical)

Importance Reweighting Using Adversarial-Collaborative Training

A summary of Deep Learning without Poor Local Minima

CSC 578 Neural Networks and Deep Learning

Presentation in Convex Optimization

Understanding Neural Networks : Part I

Bayesian Deep Learning

Summary and discussion of: Dropout Training as Adaptive Regularization

ECE521 week 3: 23/26 January 2017

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

STA141C: Big Data & High Performance Statistical Computing

Advanced computational methods X Selected Topics: SGD

Feedforward Neural Networks. Michael Collins, Columbia University

Linear Models for Regression. Sargur Srihari

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Overfitting, Bias / Variance Analysis

Neural networks and optimization

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Stochastic Gradient Descent

Transcription:

Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate loss function and early stopping Batch and minibatch algorithms Challenges in Neural Network Optimization Basic Algorithms SGD Momentum AdaGrad Regularization L2, L1 Data augmentation Optimization vs Machine Learning Optimization The selection of a best element (with regard to some criterion) from some set of available alternatives. OPTIMIZATION VS LEARNING Or Maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function. Machine Learning Optimization of a function of our data? No! 1

Goal of Machine Learning Minimize Risk J θ = E &,( ~*+, -, L f x; θ, y x data y label p 4565 true distribution of the x, y L a loss function Reduce the chance of error on a prediction task regarding a distribution P Empirical Risk Minimization J θ = E &,( ~*8+,-, L f x; θ, y = 1 m ; L f x < ; θ, y < Minimize the empirical risk function Now we re back to optimization Models with high capacity tend to overfit Some loss functions don t have useful gradients (i.e. 0-1 loss) = <>? But p 4565 is unknown! One common solution Surrogate Loss Function Replace the loss function we care about with a nicer function E.g. replace 0-1 with log-likelihood Optimization is performed until we meet some convergence criterion E.g. minimum of a validation loss Batch algorithms The objective function of ML is usually a sum over all the training examples. i.e. θ AB = argmax A < >? logp =M4NO x <, y < ; θ H All images Computing the loss over the entire dataset: 1. Computationally expensive! 2. Small variance reduction. According to CLT: σ NQ6 S T 3. May use redundant data The solution minibatch algorithms 2

Minibatch algorithms Estimate the loss from a subset of samples at each step θ AB = argmax = < >? logp =M4NO x <, y < ; θ H Minibatch - Random Sampling Random batches! Many datasets are most naturally arranged in a way where successive examples are highly correlated. m images Sampling from such a dataset would introduce a heavy bias that would impair the generalization (catastrophic forgetting) Larger batch -> more accurate estimate of the gradient (though less than linear returns) Small batches act as regularizers due to the noise they add Reminder Hessian matrix: _ H f x <,V = WX W& Y W& Z f x CHALLENGES IN NEURAL NETWORK OPTIMIZATION Taylor series expansion: f x f x \ + x x \ _ g + 1 2 x x \ _ H x x \ g - gradient 3

Ill-Conditioning The Taylor expansion for an update with learning rate ε: f x \ εg f x \ + εg _ g + 1 2 εb g _ Hg We recognize here three terms: 1. f x \ - the original value of the function 2. εg _ g the slope of the function? 3. b εb g _ Hg the correction we must apply to account for the curvature of the function Ill-Conditioning The Taylor expansion for an update with learning rate ε: f x \ εg f x \ + εg _ g + 1 2 εb g _ Hg We recognize here three terms: 1. f x \ - the original value of the function 2. εg _ g the slope of the function? 3. b εb g _ Hg the correction we must apply to account for the curvature of the function When (3) > (2) we have ill conditioning: a step in the (negative) gradient direction may increase the value of the function! Ill-Conditioning cont. In many cases, the gradient norm does not shrink significantly throughout learning, but the? b εb g _ Hgterm grows by more than an order of magnitude Learning becomes very slow despite the strong gradient The learning rate must be shrunk to compensate for even stronger curvature. Local minima Neural networks have many local minima due to 2 reasons: 1. Model identifiability Given θ?, θ b s.t. f Hf x = f HX x x Different combinations of weights may result in the same classification function 2. Local minima that are worse than global minima Not clear if this is a real problem or not Input lay er Hidden layer Output layer 4

Plateaus, Saddle Points and Other Flat Regions In high dimensional space random functions tend to have few local minima but many saddle points 1 Plateaus, Saddle Points and Other Flat Regions cont. Due to this property, (unmodified) Newton method is unsuitable for deep networks Result for shallow linear autoencoder 2 Empirically, gradient descent methods manage to escape saddle points 1 Reminder - Hessian matrix eigenvalues (at zero gradient): 1. All positive local minimum. 2. All negative local maximum. 3. Mixed Saddle point. 1Dauphin, Yann N., et al. "Identify ing and attack ing the saddle point problem in high-dimens ional non-conv ex optimization." Advances in neural information processing systems. 2014. 2Baldi, Pierre, and Kurt Hornik. "Neural network s and principal c omponent analys is: Learning from examples without local minima." Neural networks 2.1 (1989): 53-58. 1 Goodfellow, Ian J., Oriol Vinyals, and Andrew M. Saxe. "Qualitatively characterizing neural network optimization problems." arxiv preprint arxiv:.( 2014) 1412.6544 Stochastic Gradient Descent BASIC ALGORITHMS Note that ε h can change each iteration. How to choose? This is more of an art than a science, and most guidance on this subject should be regarded with some skepticism. 5

Learning Rate Consideration Considering a linear decrease: ε h = 1 α ε \ + αε j, α = h j Rules of thumb set τ to make a few hundred passed through the training data ε j ε \ 0.01 ε \ - trial & error Decrease learning rate when some loss criteria is met Momentum Modify the parameter update rule: Introducing a variable v (velocity) it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient. Momentum: Physical Analogy Continuous gradient descent: dθ dt = ε HE θ Consider the Newtonian equation for a point mass m moving in a viscous medium with friction coefficient μ under the influence of a conservative force field with potential energy E θ : m db θ dθ dtb + μ dt = HE θ Momentum: Physical Analogy cont. Discretizing the equation we get m θ 6st6 + θ 6ut6 2θ 6 Δt b + μ θ 6st6 θ 6 = Δt H E θ θ 6st6 θ 6 = Δt b m + μδt m HE θ + m + μδt θ 6 θ 6 ut6 ε α Which is equivalent to the previous formula: momentum friction Force (gradient) 6

Momentum visualization Momentum visualization distill.pub//momentum/ 2017 distill.pub//momentum/ 2017 Algorithms with Adaptive Learning Rates The learning rate is the hyper-parameter with the largest impact AdaGrad Per parameter learning rate scaled inversely proportional to the square root of historical squared values The cost may be more sensitive in some directions than in others a learning rate per parameter may be even better RMSProp: 7

Which Optimizer to use? REGULARIZATION Definition and Goal Any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Goals: Encode prior knowledge (i.e. spatial structure conv nets) Promote a simpler model Trade bias for variance Parameter Norm Penalties Limit a norm of the parameters: Jx θ; X, y = J θ; X, y + αω θ Two common approaches : 1. L b regularization: Ω θ =? b w b b a.k.a weight decay, ridge regression, Tikhonov regularization w - parameters to be regularized 2. L? regularization: Ω θ = w = w < < 8

L b Regularization Why weight decay? Jx θ; X, y Jx θ; X, y = J θ; X, y + α 2 w_ w = αw + J θ; X, y When updating w: w w ε αw + J θ; X, y Or: w 1 εα w ε J θ; X, y Weight decay L b Regularization - Intuition Consider the linear regression task: Jx θ; X, y = Xw y _ Xw y + α 2 w_ w This changes the exact solution from: w = X _ X u? X _ y To: w = X _ X + αi u? X _ y The additional α on the diagonal corresponds to increase the variance of each feature of X. L? Regularization The modified gradient Jx θ; X, y = J θ; X, y + α w Jx θ; X, y = αsign w + J θ; X, y L? regularization tends to make solutions sparse Dataset Augmentation Best way to generalize lots of data! But data is limited so we can fake data Common augmentations for object recognition: translation, scale, rotation, random crops Noise injection is also common and effective 9