CSC321 Lecture 8: Optimization

Similar documents
CSC321 Lecture 7: Optimization

Day 3 Lecture 3. Optimizing deep networks

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

CSC2541 Lecture 5 Natural Gradient

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Lecture 6 Optimization for Deep Neural Networks

Lecture 9: Generalization

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 5: Logistic Regression. Neural Networks

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Lecture 2: Linear regression

CSC321 Lecture 7 Neural language models

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Introduction to gradient descent

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 4: Learning a Classifier

Deep Feedforward Networks

ECS171: Machine Learning

CSC 411 Lecture 10: Neural Networks

Introduction to Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CS260: Machine Learning Algorithms

CSC321 Lecture 15: Exploding and Vanishing Gradients

Least Mean Squares Regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

1 What a Neural Network Computes

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

CSC321 Lecture 4: Learning a Classifier

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Deep Learning & Artificial Intelligence WS 2018/2019

CSC321 Lecture 2: Linear Regression

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Gradient Descent. Sargur Srihari

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Neural Networks: Optimization & Regularization

Multilayer Perceptrons (MLPs)

Rapid Introduction to Machine Learning/ Deep Learning

CSC321 Lecture 6: Backpropagation

Lecture 6: Backpropagation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Logistic Regression. Stochastic Gradient Descent

Input layer. Weight matrix [ ] Output layer

Deep Learning & Neural Networks Lecture 4

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC321 Lecture 10 Training RNNs

Lecture 4: Training a Classifier

CSC321 Lecture 4 The Perceptron Algorithm

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

11 More Regression; Newton s Method; ROC Curves

Least Mean Squares Regression. Machine Learning Fall 2018

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Learning in State-Space Reinforcement Learning CIS 32

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Optimization and Gradient Descent

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Stochastic Gradient Descent. CS 584: Big Data Analytics

Optimization for neural networks

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Approximate inference in Energy-Based Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 35: Optimization and Neural Nets

CSC321 Lecture 18: Learning Probabilistic Models

The K-FAC method for neural network optimization

SGD and Deep Learning

Lecture 26: Neural Nets

Stochastic Gradient Descent

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

18.6 Regression and Classification with Linear Models

CSC 411 Lecture 6: Linear Regression

OPTIMIZATION METHODS IN DEEP LEARNING

AI Programming CS F-20 Neural Networks

More Optimization. Optimization Methods. Methods

Based on the original slides of Hung-yi Lee

Comparison of Modern Stochastic Optimization Algorithms

ECS289: Scalable Machine Learning

Error Functions & Linear Regression (1)

Data Mining Part 5. Prediction

Neural Nets Supervised learning

Lecture 1: Supervised Learning

CSC321 Lecture 22: Q-Learning

Artificial Neuron (Perceptron)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Mini-Course 1: SGD Escapes Saddle Points

Accelerating Stochastic Optimization

4. Multilayer Perceptrons

Exploring the energy landscape

Transcription:

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture: various things that can go wrong in gradient descent, and what to do about them. Let s take a break from equations and think intuitively. Let s group all the parameters (weights and biases) of our network into a single vector θ. Roger Grosse CSC321 Lecture 8: Optimization 2 / 26

Optimization Visualizing gradient descent in one dimension: w w ɛ de dw The regions where gradient descent converges to a particular local minimum are called basins of attraction. Roger Grosse CSC321 Lecture 8: Optimization 3 / 26

Optimization Visualizing two-dimensional optimization problems is trickier. Surface plots can be hard to interpret: Roger Grosse CSC321 Lecture 8: Optimization 4 / 26

Optimization Recall: Level sets (or contours): sets of points on which E(θ) is constant Gradient: the vector of partial derivatives θ E = E θ = ( E θ 1, E θ 2 points in the direction of maximum increase orthogonal to the level set The gradient descent updates are opposite the gradient direction. ) Roger Grosse CSC321 Lecture 8: Optimization 5 / 26

Optimization Roger Grosse CSC321 Lecture 8: Optimization 6 / 26

Local Minima Recall: convex functions don t have local minima. This includes linear regression and logistic regression. But neural net training is not convex! Reason: if a function f is convex, then for any set of points x 1,..., x N in its domain, f (λ 1 x 1 + +λ N x N ) λ 1 f (x 1 )+ +λ N f (x N ) for λ i 0, i λ i = 1. Neural nets have a weight space symmetry: we can permute all the hidden units in a given layer and obtain an equivalent solution. Suppose we average the parameters for all K! permutations. Then we get a degenerate network where all the hidden units are identical. If the cost function were convex, this solution would have to be better than the original one, which is ridiculous! Even though any multilayer neural net can have local optima, we usually don t worry too much about them. Roger Grosse CSC321 Lecture 8: Optimization 7 / 26

Saddle points At a saddle point E θ = 0, even though we are not at a minimum. Some directions curve upwards, and others curve downwards. When would saddle points be a problem? Roger Grosse CSC321 Lecture 8: Optimization 8 / 26

Saddle points At a saddle point E θ = 0, even though we are not at a minimum. Some directions curve upwards, and others curve downwards. When would saddle points be a problem? If we re exactly on the saddle point, then we re stuck. If we re slightly to the side, then we can get unstuck. Roger Grosse CSC321 Lecture 8: Optimization 8 / 26

Saddle points Suppose you have two hidden units with identical incoming and outgoing weights. After a gradient descent update, they will still have identical weights. By induction, they ll always remain identical. But if you perturbed them slightly, they can start to move apart. Important special case: don t initialize all your weights to zero! Instead, use small random values. Roger Grosse CSC321 Lecture 8: Optimization 9 / 26

Plateaux A flat region is called a plateau. (Plural: plateaux) Can you think of examples? Roger Grosse CSC321 Lecture 8: Optimization 10 / 26

Plateaux A flat region is called a plateau. (Plural: plateaux) Can you think of examples? 0 1 loss hard threshold activations logistic activations & least squares Roger Grosse CSC321 Lecture 8: Optimization 10 / 26

Plateaux An important example of a plateau is a saturated unit. This is when it is in the flat region of its activation function. Recall the backprop equation for the weight derivative: z i = h i φ (z) w ij = z i x j If φ (z i ) is always close to zero, then the weights will get stuck. If there is a ReLU unit whose input z i is always negative, the weight derivatives will be exactly 0. We call this a dead unit. Roger Grosse CSC321 Lecture 8: Optimization 11 / 26

Ravines Long, narrow ravines: Lots of sloshing around the walls, only a small derivative along the slope of the ravine s floor. Roger Grosse CSC321 Lecture 8: Optimization 12 / 26

Ravines Suppose we have the following dataset for linear regression. x 1 x 2 t 114.8 0.00323 5.1 338.1 0.00183 3.2 98.8 0.00279 4.1... w i = y x i Which weight, w 1 or w 2, will receive a larger gradient descent update? Which one do you want to receive a larger update? Note: the figure vastly understates the narrowness of the ravine! Roger Grosse CSC321 Lecture 8: Optimization 13 / 26

Ravines Or consider the following dataset: x 1 x 2 t 1003.2 1005.1 3.3 1001.1 1008.2 4.8 998.3 1003.4 2.9... Roger Grosse CSC321 Lecture 8: Optimization 14 / 26

Ravines To avoid these problems, it s a good idea to center your inputs to zero mean and unit variance, especially when they re in arbitrary units (feet, seconds, etc.). x j = x j µ j σ j Hidden units may have non-centered activations, and this is harder to deal with. One trick: replace logistic units (which range from 0 to 1) with tanh units (which range from -1 to 1) A recent method called batch normalization explicitly centers each hidden activation. It often speeds up training by 1.5-2x, and it s available in all the major neural net frameworks. Roger Grosse CSC321 Lecture 8: Optimization 15 / 26

Momentum Unfortunately, even with these normalization tricks, narrow ravines will be a fact of life. We need algorithms that are able to deal with them. Momentum is a simple and highly effective method. Imagine a hockey puck on a frictionless surface (representing the cost function). It will accumulate momentum in the downhill direction: p µp α E θ θ θ + p α is the learning rate, just like in gradient descent. µ is a damping parameter. It should be slightly less than 1 (e.g. 0.9 or 0.99). Why not exactly 1? Roger Grosse CSC321 Lecture 8: Optimization 16 / 26

Momentum Unfortunately, even with these normalization tricks, narrow ravines will be a fact of life. We need algorithms that are able to deal with them. Momentum is a simple and highly effective method. Imagine a hockey puck on a frictionless surface (representing the cost function). It will accumulate momentum in the downhill direction: p µp α E θ θ θ + p α is the learning rate, just like in gradient descent. µ is a damping parameter. It should be slightly less than 1 (e.g. 0.9 or 0.99). Why not exactly 1? If µ = 1, conservation of energy implies it will never settle down. Roger Grosse CSC321 Lecture 8: Optimization 16 / 26

Momentum In the high curvature directions, the gradients cancel each other out, so momentum dampens the oscillations. In the low curvature directions, the gradients point in the same direction, allowing the parameters to pick up speed. If the gradient is constant (i.e. the cost surface is a plane), the parameters will reach a terminal velocity of α 1 µ E θ This suggests if you increase µ, you should lower α to compensate. Momentum sometimes helps a lot, and almost never hurts. Roger Grosse CSC321 Lecture 8: Optimization 17 / 26

Ravines Even with momentum and normalization tricks, narrow ravines are still one of the biggest obstacles in optimizing neural networks. Empirically, the curvature can be many orders of magnitude larger in some directions than others! An area of research known as second-order optimization develops algorithms which explicitly use curvature information (second derivatives), but these are complicated and difficult to scale to large neural nets and large datasets. There is an optimization procedure called Adam which uses just a little bit of curvature information and often works much better than gradient descent. It s available in all the major neural net frameworks. Roger Grosse CSC321 Lecture 8: Optimization 18 / 26

Learning Rate The learning rate α is a hyperparameter we need to tune. Here are the things that can go wrong in batch mode: α too small: slow progress α too large: oscillations α much too large: instability Good values are typically between 0.001 and 0.1. You should do a grid search if you want good performance (i.e. try 0.1, 0.03, 0.01,...). Roger Grosse CSC321 Lecture 8: Optimization 19 / 26

Training Curves To diagnose optimization problems, it s useful to look at training curves: plot the training cost as a function of iteration. Warning: it s very hard to tell from the training curves whether an optimizer has converged. They can reveal major problems, but they can t guarantee convergence. Roger Grosse CSC321 Lecture 8: Optimization 20 / 26

Stochastic Gradient Descent So far, the cost function E has been the average loss over the training examples: E(θ) = 1 N N L (i) = 1 N i=1 N L(y(x (i), θ), t (i) ). i=1 By linearity, E θ = 1 N N i=1 L (i) θ. Computing the gradient requires summing over all of the training examples. This is known as batch training. Batch training is impractical if you have a large dataset (e.g. millions of training examples)! Roger Grosse CSC321 Lecture 8: Optimization 21 / 26

Stochastic Gradient Descent Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example: θ θ α L(i) θ SGD can make significant progress before it has even looked at all the data! Mathematical justification: if you sample a training example at random, the stochastic gradient is an unbiased estimate of the batch gradient: [ ] L (i) E = 1 θ N N i=1 L (i) θ = E θ. Problem: if we only look at one training example at a time, we can t exploit efficient vectorized operations. Roger Grosse CSC321 Lecture 8: Optimization 22 / 26

Stochastic Gradient Descent Compromise approach: compute the gradients on a medium-sized set of training examples, called a mini-batch. Each entire pass over the dataset is called an epoch. Stochastic gradients computed on larger mini-batches have smaller variance: [ ] [ 1 S L (i) S ] [ ] Var = 1 S θ j S 2 Var L (i) = 1S θ Var L (i) j θ j i=1 i=1 The mini-batch size S is a hyperparameter that needs to be set. Too large: takes more memory to store the activations, and longer to compute each gradient update Too small: can t exploit vectorization A reasonable value might be S = 100. Roger Grosse CSC321 Lecture 8: Optimization 23 / 26

Stochastic Gradient Descent Batch gradient descent moves directly downhill. SGD takes steps in a noisy direction, but moves downhill on average. batch gradient descent stochastic gradient descent Roger Grosse CSC321 Lecture 8: Optimization 24 / 26

SGD Learning Rate In stochastic training, the learning rate also influences the fluctuations due to the stochasticity of the gradients. Typical strategy: Use a large learning rate early in training so you can get close to the optimum Gradually decay the learning rate to reduce the fluctuations Roger Grosse CSC321 Lecture 8: Optimization 25 / 26

SGD Learning Rate Warning: by reducing the learning rate, you reduce the fluctuations, which can appear to make the loss drop suddenly. But this can come at the expense of long-run performance. Roger Grosse CSC321 Lecture 8: Optimization 26 / 26