Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Similar documents
Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Optimization for Training I. First-Order Methods Training algorithm

Optimization for neural networks

Machine Learning: Logistic Regression. Lecture 04

Day 3 Lecture 3. Optimizing deep networks

Stochastic Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Deep Learning II: Momentum & Adaptive Step Size

Deep Feedforward Networks

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

STA141C: Big Data & High Performance Statistical Computing

Deep Learning & Artificial Intelligence WS 2018/2019

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Lecture 6 Optimization for Deep Neural Networks

CSC321 Lecture 8: Optimization

Gradient Descent. Sargur Srihari

CS260: Machine Learning Algorithms

Tips for Deep Learning

ECS171: Machine Learning

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

ECS289: Scalable Machine Learning

Adam: A Method for Stochastic Optimization

OPTIMIZATION METHODS IN DEEP LEARNING

Large-scale Stochastic Optimization

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Selected Topics in Optimization. Some slides borrowed from

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Tips for Deep Learning

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Adaptive Gradient Methods AdaGrad / Adam

CSC321 Lecture 7: Optimization

CSC 578 Neural Networks and Deep Learning

More Tips for Training Neural Network. Hung-yi Lee

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Big Data Analytics. Lucas Rego Drumond

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Neural Networks: Optimization & Regularization

Regularization and Optimization of Backpropagation

Neural Network Training

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Nonlinear Optimization Methods for Machine Learning

Lecture 1: Supervised Learning

Machine Learning. Lecture 2: Linear regression. Feng Li.

Least Mean Squares Regression. Machine Learning Fall 2018

Regression with Numerical Optimization. Logistic

Stochastic Optimization Algorithms Beyond SG

ECS289: Scalable Machine Learning

IMPROVING STOCHASTIC GRADIENT DESCENT

Optimization II: Unconstrained Multivariable

Why should you care about the solution strategies?

Second order machine learning

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Accelerating Stochastic Optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Second order machine learning

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimization Methods for Machine Learning

Notes on AdaGrad. Joseph Perla 2014

Optimization and Gradient Descent

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

Linear Regression (continued)

Mini-Course 1: SGD Escapes Saddle Points

Numerical Optimization

Lecture 3: Deep Learning Optimizations

Normalization Techniques in Training of Deep Neural Networks

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Stochastic Gradient Descent (SGD) The Classical Convergence Thoerem

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

CSC2541 Lecture 5 Natural Gradient

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Linear Regression. S. Sumitra

More Optimization. Optimization Methods. Methods

Based on the original slides of Hung-yi Lee

Least Mean Squares Regression

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

CPSC 540: Machine Learning

ECS171: Machine Learning

Numerical Optimization: Basic Concepts and Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Accelerating SVRG via second-order information

Numerical Optimization Techniques

y(x n, w) t n 2. (1)

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Optimization Methods

Optimization for Machine Learning

Coordinate Descent and Ascent Methods

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

SVRG++ with Non-uniform Sampling

Transcription:

Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning is Optimization Parametric ML involves minimizing an objective function J(w): Also called cost function, loss function, or error function. Want to find w" = argmin w J(w) Numerical optimization procedure: 1. Start with some guess for w 0, set τ = 0. 2. Update w τ to w τ+1 such that J(w τ+1 ) J(w τ ). 3. Increment τ = τ + 1. 4. Repeat from 2 until J cannot be improved anymore. 2

Gradient-based Optimization How to update w τ to w τ+1 such that J(w τ+1 ) J(w τ )? Move w in the direction of steepest descent: w /01 = w / + ηg g is the direction of steepest descent, i.e. direction along which J decreases the most. η is the learning rate and controls the magnitude of the change. 3

Gradient-based Optimization Move w in the direction of steepest descent: w /01 = w / + ηg What is the direction of steepest descent of J(w) at w τ? The gradient J(w) is in the direction of steepest ascent. Set g = J(w) => the gradient descent update: w /01 = w / η J(w / ) 4

Gradient Descent Algorithm Want to minimize a function J : R n R. J is differentiable and convex. compute gradient of J i.e. direction of steepest increase: J w = J J J,,, w 1 w ; w = 1. Set learning rate η = 0.001 (or other small value). 2. Start with some guess for w 0, set τ = 0. 3. Repeat for epochs E or until J does not improve: 4. τ = τ + 1. 5. w /01 = w / η J w / 5

Gradient Descent: Large Updates 6

Gradient Descent: Small Updates https://www.safaribooksonline.com/library/view/hands-on-machine-learning 7

The Learning Rate 1. Set learning rate η = 0.001 (or other small value). 2. Start with some guess for w 0, set τ = 0. 3. Repeat for epochs E or until J does not improve: 4. τ = τ + 1. 5. w /01 = w / η J w / How big should the learning rate be? o o If learning rate too small => slow convergence. If learning rate too big => oscillating behavior => may not even converge. 8

Learning Rate too Small 9

Learning Rate too Large 10

The Learning Rate How big should the learning rate be? If learning rate too big => oscillating behavior. If learning rate too small => hinders convergence. o Use line search (backtracking line search, conjugate gradient, ). o Use second order methods (Newton s method, L-BFGS,...). o o Requires computing or estimating the Hessian. Use a simple learning rate annealing schedule: Start with a relatively large value for the learning rate. Decrease the learning rate as a function of the number of epochs or as a function of the improvement in the objective. Use adaptive learning rates: Adagrad, Adadelta, RMSProp, Adam. 11

Gradient Descent: Nonconvex Objective Saddle point 12

Convex Multivariate Objective w 0 w 1 13

Gradient Step and Contour Lines w 0 w 1 14

Gradient Descent: Nonconvex Objectives 15

Gradient Descent & Plateaus 16

Gradient Descent & Saddle Points 17

Gradient Descent & Ravines 18

Gradient Descent & Ravines Ravines are areas where the surface curves much more steeply in one dimension than another. Common around local optima. GD oscillates across the slopes of the ravines, making slow progress towards the local optimum along the bottom. Use momentum to help accelerate GD in the relevant directions and dampen oscillations: Add a fraction of the past update vector to the current update vector. The momentum term increases for dimensions whose previous gradients point in the same direction. It reduces updates for dimensions whose gradients change sign. Also reduces the risk of getting stuck in local minima. 19

Gradient Descent & Momentum Vanilla Gradient Descent: v /01 = η J(w / ) w /01 = w / v /01 Gradient Descent w/ Momentum: v /01 = γv / + η J(w / ) w /01 = w / v /01 γ is usually set to 0.9 or similar. 20

Momentum & Nesterov Accelerated Gradient GD with Momentum: v /01 = γv / + η J(w / ) w /01 = w / v /01 Nesterov Accelerated Gradient: v /01 = γv / + η J(w / γv / ) w /01 = w / v /01 By making an anticipatory update, NAGs prevents GD from going too fast => significant improvements when training RNNs. 21

Gradient Descent Optimization Algorithms Momentum. Nesterov Accelerated Gradient (NAG). Adaptive learning rates methods: Idea is to perform larger updates for infrequent params and smaller updates for frequent params, by accumulating previous gradient values for each parameter. Adagrad. Adadelta. RMSProp. Adaptive Moment Estimation (Adam) 22

Visualization Adagrad, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Insofar, Adam might be the best overall choice. 23

Variants of Gradient Descent w /01 = w / η J w / Depeding on how much data is used to compute the gradient at each step: Batch gradient descent: Use all the training examples. Stochastic gradient descent (SGD). Use one training example, update after each. Minibatch gradient descent. Use a constant number of training examples (minibatch). 24

Batch Gradient Descent Sum-of-squares error: H J w = 1 2N C h w(x (=) ; ) t n =I1 w /01 = w / η J w / H w /01 = w / η 1 N C =I1 h w (x (=) ) t n x (=) 25

Stochastic Gradient Descent Sum-of-squares error: H J w = 1 2N C h w(x (=) ; ) t n =I1 H = 1 2N C J w/, x (=) =I1 w /01 = w / η J w /, x (=) w /01 = w / η h w (x (=) ) t n x (=) Update parameters w after each example, sequentially: => the least-mean-square (LMS) algorithm. 26

Batch GD vs. Stochastic GD Accuracy: Time complexity: Memory complexity: Online learning: 27

Batch GD vs. Stochastic GD 28

Pre-processing Features Features may have very different scales, e.g. x 1 = rooms vs. x 2 = size in sq ft. Right (different scales): GD goes first towards the bottom of the bowl, then slowly along an almost flat valley. Left (scaled features): GD goes straight towards the minimum. 29

Feature Scaling Scaling between [0, 1] or [ 1, +1]: For each feature x j, compute min j and max j over the training examples. Scale x (n) j as follows: Scaling to standard normal distribution: For each feature x j, compute sample μ j and sample σ j over the training examples. Scale x (n) j as follows: 30

Implementation: Gradient Checking Want to minimize J(θ), where θ is a scalar. Mathematical definition of derivative: d J(θ +ε) J(θ ε) J(θ) = lim dθ ε 2ε Numerical approximation of derivative: d J(θ +ε) J(θ ε) J(θ) dθ 2ε where ε = 0.0001 31

Implementation: Gradient Checking If θ is a vector of parameters θ i, Compute numerical derivative with respect to each θ i. Aggregate all derivatives into numerical gradient G num (θ). Compare numerical gradient G num (θ) with implementation of gradient G imp (θ): G num (θ) G imp (θ) G num (θ)+ G imp (θ) 10 6 32

Gradient Descent vs. Normal Equations Gradient Descent: Need to select learning rate η. May need many iterations: Can do Early Stopping on validation data for regularization. Scalable when number of training examples N is large. Normal Equations: No iterations => easy to code. Computing (X M X) -1 has cubic time complexity => slow for large N. X M X may be singular: 1. Redundant (linearly dependent) features. 2. #features > #examples => do feature selection or regularization. 33