Day 3 Lecture 3. Optimizing deep networks

Similar documents
Deep Learning II: Momentum & Adaptive Step Size

CSC321 Lecture 8: Optimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Optimization for Training I. First-Order Methods Training algorithm

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Deep Learning & Artificial Intelligence WS 2018/2019

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Deep Feedforward Networks

Lecture 6 Optimization for Deep Neural Networks

Optimization for neural networks

CSC321 Lecture 7: Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Adam: A Method for Stochastic Optimization

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Introduction to Neural Networks

OPTIMIZATION METHODS IN DEEP LEARNING

Advanced Training Techniques. Prajit Ramachandran

Normalization Techniques in Training of Deep Neural Networks

More Tips for Training Neural Network. Hung-yi Lee

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Deep Learning Lecture 2

Non-convex optimization. Issam Laradji

Tips for Deep Learning

CSC 578 Neural Networks and Deep Learning

Comparison of Modern Stochastic Optimization Algorithms

Neural Networks: Optimization & Regularization

arxiv: v1 [cs.lg] 16 Jun 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning Lecture 14

Nonlinear Optimization Methods for Machine Learning

Neural Network Training

Understanding Neural Networks : Part I

Normalization Techniques

Tips for Deep Learning

1 What a Neural Network Computes

Adaptive Gradient Methods AdaGrad / Adam

Lecture 5: Logistic Regression. Neural Networks

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

STA141C: Big Data & High Performance Statistical Computing

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

IMPROVING STOCHASTIC GRADIENT DESCENT

Stochastic Gradient Descent

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

ECS289: Scalable Machine Learning

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

CS260: Machine Learning Algorithms

The K-FAC method for neural network optimization

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Lecture 3: Deep Learning Optimizations

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Regularization and Optimization of Backpropagation

Rapid Introduction to Machine Learning/ Deep Learning

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Neural Networks (Part 1) Goals for the lecture

Large-scale Stochastic Optimization

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Advanced computational methods X Selected Topics: SGD

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

A picture of the energy landscape of! deep neural networks

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

More Optimization. Optimization Methods. Methods

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Why should you care about the solution strategies?

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Logistic Regression. Stochastic Gradient Descent

Accelerating Stochastic Optimization

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Neuron (Perceptron)

Kernelized Perceptron Support Vector Machines

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST

Deep Learning & Neural Networks Lecture 4

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

CPSC 540: Machine Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Based on the original slides of Hung-yi Lee

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

DeepLearning on FPGAs

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Based on the original slides of Hung-yi Lee

HOMEWORK #4: LOGISTIC REGRESSION

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Selected Topics in Optimization. Some slides borrowed from

Ad Placement Strategies

Computational statistics

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Transcription:

Day 3 Lecture 3 Optimizing deep networks

Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x

Gradient descent E.g. linear regression Loss function Tangent line wt L Need to optimize L wt+1 Gradient descent w

Stochastic gradient descent Computing the gradient for the full dataset at each step is slow Note: Especially if the dataset is large For many loss functions we care about, the gradient is the average over losses on individual examples Idea: Pick a single random training example Estimate a (noisy) loss on this single training example (the stochastic gradient) Compute gradient wrt. this loss Take a step of gradient descent using the estimated loss

Non-convex optimization Objective function in deep networks is non-convex f(x) May be many local minima Plateaus: flat regions Saddle points Q: Why does SGD seem to work so well for optimizing these complex non-convex functions?? x

Local minima Q: Why doesn t SGD get stuck at local minima? A: It does. But: Theory and experiments suggest that for high dimensional deep models, value of loss function at most local minima is close to value of loss function at global minimum. Value of local minima found by running SGD for 200 iterations on a simplified version of MNIST from different initial starting points. As number of parameters increases, local minima tend to cluster more tightly. Most local minima are good local minima! Choromanska et al. The loss surfaces of multilayer networks, AISTATS 2015 http://arxiv.org/abs/1412.0233

Saddle points Intuition Random matrix theory: P(eigenvalue > 0) ~ 0.5 Q: Are there many saddle points in highdimensional loss functions? At a critical point (zero grad) N dimensions means we need N positive eigenvalues to be local min. A: Local minima dominate in low dimensions, but saddle points dominate in high dimensions. As N grows it becomes exponentially unlikely to randomly pick all eigenvalues to be positive or negative, and therefore most critical points are saddle points. Why? Eigenvalues of the Hessian matrix Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014 http://arxiv.org/abs/1406.2572

Saddle points Q: Does SGD get stuck at saddle points? A: No, not really Gradient descent is initially attracted to saddle points, but unless it hits the critical point exactly, it will be repelled when close. Hitting critical point exactly is unlikely: estimated gradient of loss is stochastic Warning: Newton s method works poorly for neural nets as it is attracted to saddle points SGD tends to oscillate between slowly approaching a saddle point and quickly escaping from it

Plateaus Regions of the weight space where loss function is mostly flat (small gradients). Can sometimes be avoided using: Careful initialization Non-saturating transfer functions Dynamic gradient scaling Network design Loss function design

Learning rates and initialization

Choosing the learning rate For most first order optimization methods, we need to choose a learning rate (aka step size) Loss Too large: overshoots local minimum, loss increases Too small: makes very slow progress, can get stuck Good learning rate: makes steady progress toward local minimum α too large w Usually want a higher learning rate at the start and a lower one later on. t L Good α Common strategy in practice: Start off with a high LR (like 0.1-0.001), Run for several epochs (1-10) Decrease LR by multiplying a constant factor (0.1-0.5) w α too small

Weight initialization Need to pick a starting point for gradient descent: an initial set of weights Zero is a very bad idea! Ideally we want inputs to activation functions (e. g. sigmoid, tanh, ReLU) to be mostly in the linear area to allow larger gradients to propagate and converge faster. Zero is a critical point Error signal will not propagate Gradients will be zero: no progress Large gradient tanh Constant value also bad idea: Need to break symmetry Small gradient Use small random values: E.g. zero mean Gaussian noise with constant variance 0 bad good

Batch normalization As learning progresses, the distribution of layer inputs changes due to parameter updates. Also adds a learnable scale and bias term to allow the network to still use the nonlinearity. This can result in most inputs being in the nonlinear regime of the activation function and slow down learning. Usually allows much higher learning rates! Batch normalization is a technique to reduce this effect. conv/fc no bias! Batch Normalization Works by re-normalizing layer inputs to have zero mean and unit standard deviation with respect to running batch estimates. ReLU Ioffe and Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, JMRL 2015 https://arxiv. org/abs/1502.03167

First-order optimization algorithms (SGD bells and whistles)

Vanilla mini-batch SGD Evaluated on a mini-batch

Momentum 2x memory for parameters!

Nesterov accelerated gradient (NAG) Approximate what the parameters will be on the next time step by using the current velocity. Update the velocity using gradient where we predict we will be, instead of where we are now. Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). What we expect the parameters to be based on momentum alone

Adagrad Adapts the learning rate for each of the parameters based on sizes of previous updates. Scales updates to be larger for parameters that are updated less Scales updates to be smaller for parameters that are updated more Store sum of squares of gradients so far in diagonal of matrix Gt Gradient of loss at timestep i Update rule: Duchi et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMRL 2011

RMSProp Modification of Adagrad to address aggressively decaying learning rate. Instead of storing sum of squares of gradient over all time steps so far, use a decayed moving average of sum of squares of gradients Update rule: Geoff Hinton, Unpublished

Adam Combines momentum and RMSProp Keep decaying average of both first-order moment of gradient (momentum) and second-order moment (like RMSProp) First-order: Second-order: 3x memory! Update rule: Kingma et al. Adam: a Method for Stochastic Optimization. ICLR 2015

Images credit: Alec Radford.

Images credit: Alec Radford.

Summary Non-convex optimization means local minima and saddle points In high dimensions, there are many more saddle points than local optima Saddle points attract, but usually SGD can escape Choosing a good learning rate is critical Weight initialization is key to ensuring gradients propagate nicely (also batch normalization) Several SGD extensions that can help improve convergence