Introduction to gradient descent

Similar documents
Gradient Descent. Dr. Xiaowei Huang

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Performance Surfaces and Optimum Points

Gradient Descent. Sargur Srihari

Optimization for neural networks

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Optimization Tutorial 1. Basic Gradient Descent

Math 273a: Optimization Netwon s methods

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Unconstrained optimization

Optimization and Gradient Descent

Neural Network Training

CSC321 Lecture 8: Optimization

REVIEW OF DIFFERENTIAL CALCULUS

Appendix 3. Derivatives of Vectors and Vector-Valued Functions DERIVATIVES OF VECTORS AND VECTOR-VALUED FUNCTIONS

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

IFT Lecture 2 Basics of convex analysis and gradient descent

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

CSC321 Lecture 7: Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Unconstrained minimization of smooth functions

Advanced computational methods X Selected Topics: SGD

10-725/36-725: Convex Optimization Prerequisite Topics

Generalized Gradient Descent Algorithms

Appendix 5. Derivatives of Vectors and Vector-Valued Functions. Draft Version 24 November 2008

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Vector Derivatives and the Gradient

Generative adversarial networks

Quasi-Newton Methods

From Stationary Methods to Krylov Subspaces

Basic Math for

OPTIMIZATION METHODS IN DEEP LEARNING

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Nonlinear Optimization

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Math (P)refresher Lecture 8: Unconstrained Optimization

Numerical Optimization

Lecture 16 Solving GLMs via IRWLS

Improving the Convergence of Back-Propogation Learning with Second Order Methods

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

Introduction to Nonlinear Optimization Paul J. Atzberger

MATH 4211/6211 Optimization Quasi-Newton Method

Introduction to Optimization

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 5: September 12

ECE580 Partial Solution to Problem Set 3

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Linear Regression. S. Sumitra

Mathematical optimization

Non-convex optimization. Issam Laradji

Data Mining (Mineria de Dades)

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Multivariate Newton Minimanization

Stochastic Optimization Algorithms Beyond SG

Functions of Several Variables

SOLUTIONS to Exercises from Optimization

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

Transpose & Dot Product

Nonlinear equations and optimization

Lecture 6 Optimization for Deep Neural Networks

Day 3 Lecture 3. Optimizing deep networks

Sufficient Conditions for Finite-variable Constrained Minimization

Math 302 Outcome Statements Winter 2013

Convex Optimization. Problem set 2. Due Monday April 26th

Transpose & Dot Product

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Note: Every graph is a level set (why?). But not every level set is a graph. Graphs must pass the vertical line test. (Level sets may or may not.

Selected Topics in Optimization. Some slides borrowed from

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Math 273a: Optimization Basic concepts

Math 5BI: Problem Set 6 Gradient dynamical systems

OR MSc Maths Revision Course

Nonlinear Optimization: What s important?

Preliminary draft only: please check for final version

CHAPTER 2: QUADRATIC PROGRAMMING

A Note on Two Different Types of Matrices and Their Applications

Optimality Conditions

Lecture 11: October 2

Chapter 3 Numerical Methods

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Meaning of the Hessian of a function in a critical point

Conditional Gradient (Frank-Wolfe) Method

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

22A-2 SUMMER 2014 LECTURE Agenda

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

GRADIENT = STEEPEST DESCENT

Preliminary draft only: please check for final version

Recurrent neural networks

COMP 558 lecture 18 Nov. 15, 2010

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

8 Numerical methods for unconstrained problems

Transcription:

6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian

6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our goal in machine learning is to optimize an objective function, f(x). (Without loss of generality, we ll consider minimizing f(x). This is equivalent to maximizing f(x).) From basic calculus, we recall that the derivative of a function, df(x) tells dx us the slope of f(x) at point x. For small enough ɛ, f(x + ɛ) f(x) + ɛf (x). This tells us how to reduce (or increase) f( ) for small enough steps. Recall that when f (x) = 0, we are at a stationary point or critical point. This may be a local or global minimum, a local or global maximum, or a saddle point of the function. In this class we will consider cases where we would like to maximize f w.r.t. vectors and matrices, e.g., f(x) and f(x). Further, often f( ) contains a nonlinearity or non-differentiable function. In these cases, we can t simply set f ( ) = 0, because this does not admit a closed-form solution. However, we can iteratively approach an critical point via gradient descent.

6-3: Introduction to gradient descent Prof. J.C. Kao, UCLA Terminology A global minimum is the point, x g, that achieves the absolute lowest value of f(x). i.e., f(x) f(x g) for all x. A local minimum is a point, x l, that is a critical point of f(x) and is lower than its neighboring points. However, f(x l ) > f(x g). Analogous definitions hold for the global maximum and local maximum. A saddle point are critical point of f(x) that are not local maxima or minima. Concretely, neighboring points are both greater than and less than f(x).

6-4: Introduction to gradient descent Prof. J.C. Kao, UCLA Gradient Recall the gradient, xf(x), is a vector whose ith element is the partial derivative of f(x) w.r.t. x i, the ith element of x. Concretely, for x R n, xf(x) = f(x) x 1 f(x) x 2.. f(x) (x n) The gradient tells us how a small change in x affects f(x) through f(x + x) f(x) + x T xf(x) The directional derivative of f(x) in the direction of the unit vector u is given by: u T xf(x) The directional derivative tells us the slope of f in the direction u.

6-5: Introduction to gradient descent Prof. J.C. Kao, UCLA Arriving at gradient descent To minimize f(x), we want to find the direction in which f(x) decreases the fastest. To do so, we find the direction u which minimizes the directional derivative. min u, u =1 ut xf(x) = min u xf(x) cos θ u, u =1 = min u xf(x) cos(θ) where θ is the angle between the vectors u and xf(x). This quantity is minimized for u pointing in the opposite direction of the gradient, so that cos(θ) = 1. Hence, we arrive at gradient descent. To update x so as to minimize f(x), we repeatedly calculate: x := x ɛ xf(x) ɛ is typically called the learning rate. It can change over iterations. Setting the value of ɛ appropriately is an important part of deep learning.

6-6: Introduction to gradient descent Prof. J.C. Kao, UCLA Hessian Recall Newton s method determines the step size by considering the curvature of the function, which is the second derivative. The generalization of the second derivative of a scalar-valued function, f(x), is the Hessian. The Hessian, H, is a matrix whose (i, j)th element is the second derivative of f(x) w.r.t. x i and x j. That is, the (i, j)th element of H, denoted H ij, is The matrix, H R n n is: H = x 1 x 1 x 2 x 1. x n x 1 H ij = 2 x i x j f(x) x 1 x 2... x 1 x n x 2 x n x 2 x 2.......... x n x 2... x n x n

6-7: Introduction to gradient descent Prof. J.C. Kao, UCLA Hessian (cont.) A few notes on the Hessian: Recall in the mathematical tools notes that we can denote the Hessian as xf 2 (x) If the partial second derivatives are continuous, then is commutative, and the Hessian is thus symmetric.

6-8: Introduction to gradient descent Prof. J.C. Kao, UCLA Directional second derivative The directional second derivative along unit vector u is given by u T Hu. As H in general is symmetric and thus has a set of real eigenvalues with an orthogonal basis of eigenvectors, if u points in the direction of an eigenvector of H, the directional second derivative is the corresponding eigenvalue, λ. The maximum directional second derivative is λ max and the minimum is λ min.

6-9: Introduction to gradient descent Prof. J.C. Kao, UCLA The second derivative The 2nd derivative of f(x) is a measure of the curvature of the function at x. If H is positive definite at a critical point, then x is a local minimum. This is because the directional second derivative is positive, and so in every direction, the function curves upwards. If H is negative definite at a critical point, then x is a local maximum. If H has both positive and negative eigenvalues, it is a saddle point. Intuitively, if f(x) is relatively flat around x, we would like to take large steps along the gradient; if it is very curved, we would like to take small steps. However, incorporating the Hessian ought to enable taking better steps in each dimension. This is called Newton s method.

6-10: Introduction to gradient descent Prof. J.C. Kao, UCLA Newton s method We seek to find the step, x that minimizes the second order Taylor expansion of f(x + x). f(x + x) f(x) + xf(x) T x + 1 2 xt H x Taking the derivative w.r.t. x and setting it equal to zero, we arrive at: xf(x) + H = 0 Hence, we arrive at the optimal Newton step: x = H 1 xf(x) This method of optimization is called a second-order optimization algorithm because it uses the second derivative.

6-11: Introduction to gradient descent Prof. J.C. Kao, UCLA Limitations of Newton s method While Newton s method may converge far more quickly than gradient descent, it may also locate saddle points. Storing the Hessian can be expensive. Inverting the Hessian can be even more expensive. Typically deep learning does not use Newton s method, although there are second-order techniques that don t require computing and inverting the Hessian. We ll talk about this more in the optimization for neural networks slides.

6-12: Introduction to gradient descent Prof. J.C. Kao, UCLA Lipschitz continuity In deep learning, we typically deal with functions that are Lipschitz continuous. A function f is Lipschitz continuous if for a Lipschitz constant l, for all x and y. f(x) f(y) l x y This condition means that a small input made in gradient descent will have a small change on the output.