Gradient Descent. Sargur Srihari

Similar documents
Gradient Descent. Dr. Xiaowei Huang

Neural Network Training

Introduction to gradient descent

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Linear Models for Regression. Sargur Srihari

Optimization and Gradient Descent

Learning Parameters of Undirected Models. Sargur Srihari

Iterative Reweighted Least Squares

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Unconstrained optimization

Why should you care about the solution strategies?

ECS171: Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Learning Parameters of Undirected Models. Sargur Srihari

Least Mean Squares Regression. Machine Learning Fall 2018

CSC321 Lecture 8: Optimization

Optimization for neural networks

Linear Regression (continued)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Linear Models for Regression

Multiclass Logistic Regression

Machine Learning and Data Mining. Linear regression. Kalev Kask

1 Numerical optimization

Lecture 17: Numerical Optimization October 2014

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

1 Numerical optimization

Gradient-Based Learning. Sargur N. Srihari

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

OPTIMIZATION METHODS IN DEEP LEARNING

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Deep Learning & Artificial Intelligence WS 2018/2019

Machine Learning Basics III

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

Linear Models for Regression

Day 3 Lecture 3. Optimizing deep networks

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Least Mean Squares Regression

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Linear Models for Regression CS534

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSC321 Lecture 7: Optimization

Performance Surfaces and Optimum Points

Reading Group on Deep Learning Session 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECS171: Machine Learning

Learning with Momentum, Conjugate Gradient Learning

Linear discriminant functions

CSC2541 Lecture 5 Natural Gradient

Lecture 1: Supervised Learning

Regularization in Neural Networks

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Stochastic Analogues to Deterministic Optimizers

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Linear classifiers: Overfitting and regularization

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Logistic Regression Logistic

Optimization Tutorial 1. Basic Gradient Descent

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Selected Topics in Optimization. Some slides borrowed from

Linear Models for Regression CS534

Stochastic Gradient Descent

The Steepest Descent Algorithm for Unconstrained Optimization

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Logistic Regression. Neural Networks

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lecture 4: Perceptrons and Multilayer Perceptrons

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification with Perceptrons. Reading:

Gaussian and Linear Discriminant Analysis; Multiclass Classification

More Optimization. Optimization Methods. Methods

Machine Learning Basics: Maximum Likelihood Estimation

Review of Classical Optimization

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Optimization methods

Nonlinear Optimization

Chapter 6: Derivative-Based. optimization 1

Logistic Regression. Seungjin Choi

Bayesian Linear Regression. Sargur Srihari

Lecture 3 - Linear and Logistic Regression

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Classification Logistic Regression

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Neural Nets Supervised learning

Variational Inference and Learning. Sargur N. Srihari

Stochastic Gradient Descent. CS 584: Big Data Analytics

ECS289: Scalable Machine Learning

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Transcription:

Gradient Descent Sargur srihari@cedar.buffalo.edu 1

Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors to minimize error Stochastic Gradient Descent 2

Gradient-Based Optimization Most ML algorithms involve optimization Minimize/maximize a function f (x) by altering x Usually stated a minimization Maximization accomplished by minimizing f (x) f (x) referred to as objective or criterion In minimization also referred to as loss function cost, or error Example is linear least squares f (x) = 1 Ax b 2 2 Denote optimum value by x*=argmin f (x) 3

Calculus in Optimization Consider function y=f (x), x, y real nos. Derivative of function denoted: f (x) or as dy/dx Derivative f (x) gives the slope of f (x) at point x It specifies how to scale a small change in input to obtain a corresponding change in the output: f (x + ε) f (x) + ε f (x) It tells how you make a small change in input to make a small improvement in y We know that f (x - ε sign (f (x))) is less than f (x) for small ε. Thus we can reduce f (x) by moving x in small steps with opposite sign of derivative This technique is called gradient descent (Cauchy 1847)

Gradient Descent Illustrated Given function is f (x)=½ x 2 which has a bowl shape with global minimum at x=0 Since f (x)=x For x>0, f(x) increases with x and f (x)>0 For x<0, f(x) decreases with x and f (x)<0 Use f (x) to follow function downhill Reduce f (x) by going in direction opposite sign of derivative f (x) 5

Minimizing with multiple inputs We often minimize functions with multiple inputs: f: R n àr! For minimization to make sense there must still be only one (scalar) output 6

Application in ML: Minimize Error Negated gradient Direction in w 0 -w 1 plane producing steepest descent Error surface shows desirability of every weight vector, a parabola with a single global minimum 7 Gradient descent search determines a weight vector w that minimizes E(w) by Starting with an arbitrary initial weight vector Repeatedly modifying it in small steps At each step, weight vector is modified in the direction that produces the steepest descent along the error surface

Definition of Gradient Vector The Gradient (derivative) of E with respect to each component of the vector w E[w] E[w!" ] E, E,# E w 0 w 1 w n Notice is a vector of partial derivatives Specifies the direction that produces steepest increase in E Negative of this vector specifies direction of steepest decrease 8

Directional Derivative Directional derivative in direction u (a unit vector) is the slope of function f in direction u u T x f ( x) This evaluates to To minimize f find direction in which f decreases the fastest ( ) = min Do this using min u T u,u T x f x u=1 u,u T u=1 where θ is angle between u and the gradient Substitute u 2 =1 and ignore factors that not depend on u this simplifies to min u cosθ This is minimized when u points in direction opposite to gradient In other words, the gradient points directly uphill, and 9 the negative gradient points directly downhill u 2 x f ( x) cosθ 2

10 where η Gradient Descent Rule is a positive constant called the learning rate Determines step size of gradient descent search Component Form of Gradient Descent Can also be written as where w! w! + Δw Δw = η E[w] w i Δw i w i E = η w + Δw i i

Method of Gradient Descent The gradient points directly uphill, and the negative gradient points directly downhill Thus we can decrease f by moving in the direction of the negative gradient This is known as the method of steepest descent or gradient descent Steepest descent proposes a new point x' = x η x f ( x) where ε is the learning rate, a positive scalar. Set to a small constant. 11

Simple Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do ( ) 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Taylor s expansion of function f(θ) in the neighborhood of θ t is Let θ=θ t+1 =θ t +h, thus f (θ t+1 ) f (θ t )+ h f (θ t ) Derivative of f(θ t+1 ) wrt h is f (θ t ) At h = f (θ t ) a maximum occurs (since h 2 is positive) and at h = f (θ t ) a minimum occurs. Alternatively, The slope f (θ t ) points to the direction of steepest ascent. If we take a step η in the opposite direction we decrease the value of f f (θ) f (θ t )+(θ θ t ) T f (θ t ) One-dimensional example Let f(θ)=θ 2 This function has minimum at θ=0 which we want to determine using gradient descent We have f (θ)=2θ For gradient descent, we update by f (θ) If θ t > 0 then θ t+1 <θ t If θ t <0 then f (θ t )=2θ t is negative, thus θ t+1 >θ t

Ex: Gradient Descent on Least Squares Criterion to minimize Least squares regression The gradient is Gradient Descent algorithm is 1. Set step size ε, tolerance δ to small, positive nos. 2. while do 3. end while f (x) = 1 Ax b 2 2 x f ( x) = A T ( Ax b) = A T Ax A T b A T Ax A T b 2 > δ ( ) x x η A T Ax A T b E D N 1 ( w) = 2 n= 1 { } 2 T t w φ( x ) n n 13

Stationary points, Local Optima When f (x)=0 derivative provides no information about direction of move Points where f (x)=0 are known as stationary or critical points Local minimum/maximum: a point where f(x) lower/ higher than all its neighbors Saddle Points: neither maxima nor minima 14

Presence of multiple minima Optimization algorithms may fail to find global minimum Generally accept such solutions 15

Difficulties with Simple Gradient Descent Performance depends on choice of learning rate η Large learning rate Overshoots Small learning rate Extremely slow Solution ( ) θ t+1 θ t η f θ t Start with large η and settle on optimal value Need a schedule for shrinking η 16

Convergence of Steepest Descent Steepest descent converges when every element of the gradient is zero In practice, very close to zero We may be able to avoid iterative algorithm and jump to the critical point by solving the equation x f ( x) = 0 for x 17

Choosing η: Line Search We can choose η in several different ways Popular approach: set η to a small constant Another approach is called line search: Evaluate f (x η x f ( x) for several values of η and choose the one that results in smallest objective function value 18

Line Search (with ascent) In Gradient Ascent, we increase f by moving in the direction of the gradient Intuition of Line Search: Adaptive choice of step size η at each step Choose direction to ascend and continue in direction until we start to descend Define line in direction of gradient g η ( ) θ t+1 θ t + η f θ t ( ) ( ) =! θ t + η f θ t Note that this is a linear function of η and hence it is a line

Line Search: determining η Given g ( η) = θ! t + η f ( θ t ) Find three points η 1 <η 2 <η 3 so that f (g(η 2 )) is larger than at both f (g(η 1 )) and f (g(η 3 )) We say that η 1 <η 2 <η 3 bracket a maximum If we find an η so that we can find a new tighter bracket η 1 <η <η 2 - To find η use binary search Choose η =(η 1 +η 3 )/2 Method ensures that new bracket is half of the old one

Brent s Method Faster than line search Use quadratic function Based on values of f at η 1, η 2, η 3 Solid line is f (g(η)) Three points bracket its maximum Dashed line shows quadratic fit

Conjugate Gradient Descent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f (θ t+1 ) f (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f(x,y)=-(x 2 +10y 2 ) Exponential: f(x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent

Sequential (On-line) Learning Maximum likelihood solution is w ML = (Φ T Φ) 1 Φ T t It is a batch technique Processing entire training set in one go It is computationally expensive for large data sets Due to huge N x M Design matrix Φ Solution is to use a sequential algorithm where samples are presented one at a time Called Stochastic gradient descent 23

Stochastic Gradient Descent Error function sums over data E D (w) = 1 N { t n w T ϕ(x n ) } 2 n=1 Denoting E D (w)=σ n E n update parameter vector w using ( τ + 1) ( τ ) w = w η E n Substituting for the derivative w (τ +1) = w (τ ) + η(t n w (τ ) φ(x n ))φ(x n ) w is initialized to some starting vector w (0) η chosen with care to ensure convergence Known as Least Mean Squares Algorithm 2 24

SGD Performance Most practitioners use SGD for DL Consists of showing the input vector for a few examples, computing the outputs and the errors, Computing the average gradient for those examples, and adjusting the weights accordingly. Process repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. Called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. Usually finds a good set of weights quickly compared to elaborate optimization techniques. After training, performance is measured on a different test set: tests generalization ability of the machine its ability to produce sensible answers on new inputs 25 never seen during training. Flucutations in objective as gradient steps are taken in mini batches