Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Similar documents
Gradient Descent. Dr. Xiaowei Huang

Introduction to gradient descent

Gradient Descent. Sargur Srihari

Numerical Computation for Deep Learning

Neural Network Training

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Introduction to Optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Numerical Optimization

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Sufficient Conditions for Finite-variable Constrained Minimization

Vector Derivatives and the Gradient

Nonlinear Optimization: What s important?

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

September Math Course: First Order Derivative

g(x,y) = c. For instance (see Figure 1 on the right), consider the optimization problem maximize subject to

IFT Lecture 2 Basics of convex analysis and gradient descent

3.7 Constrained Optimization and Lagrange Multipliers

Math 411 Preliminaries

Appendix 3. Derivatives of Vectors and Vector-Valued Functions DERIVATIVES OF VECTORS AND VECTOR-VALUED FUNCTIONS

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Comparison of Modern Stochastic Optimization Algorithms

V. Graph Sketching and Max-Min Problems

Numerical optimization

Performance Surfaces and Optimum Points

Math 273a: Optimization Basic concepts

REVIEW OF DIFFERENTIAL CALCULUS

Appendix 5. Derivatives of Vectors and Vector-Valued Functions. Draft Version 24 November 2008

Lecture: Adaptive Filtering

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Exploring the energy landscape

Optimization Methods

II. An Application of Derivatives: Optimization

CS60010: Deep Learning

Tangent spaces, normals and extrema

Optimization Tutorial 1. Basic Gradient Descent

Deep Learning for Computer Vision

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Optimization and Gradient Descent

Numerical Optimization

39.1 Absolute maxima/minima

Exponentiated Gradient Descent

Line Search Algorithms

Nonlinear Optimization for Optimal Control

School of Business. Blank Page

Deep Feedforward Networks

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

Scientific Computing: Optimization

Non-convex optimization. Issam Laradji

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Unconstrained optimization

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

Chapter 8 Gradient Methods

Minimization of Static! Cost Functions!

Optimality Conditions

Chapter 3 Numerical Methods

Constrained Optimization

ECE580 Partial Solution to Problem Set 3

5 Handling Constraints

Gradient-Based Learning. Sargur N. Srihari

Optimization methods

1 Kernel methods & optimization

Constrained optimization: direct methods (cont.)

Error Functions & Linear Regression (1)

Written Examination

Iterative Reweighted Least Squares

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Chapter 1 Computer Arithmetic

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Optimization for neural networks

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Conditional Gradient (Frank-Wolfe) Method

Math 302 Outcome Statements Winter 2013

Divisibility, Factors, and Multiples

Lecture 7: Indeterminate forms; L Hôpitals rule; Relative rates of growth. If we try to simply substitute x = 1 into the expression, we get

Chapter 3: The Derivative in Graphing and Applications

This ensures that we walk downhill. For fixed λ not even this may be the case.

Unconstrained Multivariate Optimization

Constrained Optimization and Lagrangian Duality

Math 180, Exam 2, Practice Fall 2009 Problem 1 Solution. f(x) = arcsin(2x + 1) = sin 1 (3x + 1), lnx

Linear & nonlinear classifiers

Feedforward Neural Networks. Michael Collins, Columbia University

GENG2140, S2, 2012 Week 7: Curve fitting

Statistical Machine Learning from Data

Based on the original slides of Hung-yi Lee

Adaptive Filtering Part II

1 Lecture 25: Extreme values

Queens College, CUNY, Department of Computer Science Numerical Methods CSCI 361 / 761 Spring 2018 Instructor: Dr. Sateesh Mane.

Linear Regression. S. Sumitra

This exam will be over material covered in class from Monday 14 February through Tuesday 8 March, corresponding to sections in the text.

Line Search Methods for Unconstrained Optimisation

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Suppose that f is continuous on [a, b] and differentiable on (a, b). Then

Lecture 3 Math & Probability Background

Linear & nonlinear classifiers

Transcription:

Chapter 4: Numerical Computation Deep Learning Authors: I. Goodfellow, Y. Bengio, A. Courville Lecture slides edited by 1

Chapter 4: Numerical Computation 4.1 Overflow and Underflow 4.2 Poor Conditioning 4.3 Gradient-based Optimization 4.4 Constrained Optimization 2

Introduction Machine learning algorithms usually require a high amount of numerical computation. This typically refers to algorithms that solve mathematical problems by methods that update estimates of the solution via an iterative process, rather than analytically deriving a formula providing a symbolic expression for the correct solution. Common operations include optimization (finding the value of an argument that minimizes or maximizes a function) and solving systems of linear equations. Even just evaluating a mathematical function on a digital computer can be difficult when the function involves real numbers. 3

4.1 Overflow and Underflow The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. Underflow occurs when numbers near zero are rounded to zero. Overflow occurs when numbers with large magnitude are approximated as or. One example of a function that must be stabilized against underflow and overflow is the softmax function. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. 4

Overflow and Underflow Consider what happens when all of the x i are equal to some constant c. If c is very negative, then exp(c) will underflow. This means the denominator of the softmax will become 0, so the final result is undefined. When c is very large and positive, exp(c) will overflow, again resulting in the expression as a whole being undefined. 5

Overflow and Underflow Both of these difficulties can be resolved by instead evaluating softmax(z) where z = x max i xi. Subtracting max i x i results in the largest argument to exp being 0, which rules out the possibility of overflow. Likewise, at least one term in the denominator has a value of 1, which rules out the possibility of underflow in the denominator leading to a division by zero. 6

4.2 Poor Conditioning Conditioning refers to how rapidly a function changes with respect to small changes in its inputs. Consider the function. When has an eigenvalue decomposition, its condition number is, This is the ratio of the magnitude of the largest and smallest eigenvalue. When this number is large, matrix inversion is particularly sensitive to error in the input. 7

4.3 Gradient-Based Optimization Most deep learning algorithms involve optimization. Optimization refers to the task of either minimizing or maximizing some function f(x) by altering x. The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. We often denote the value that minimizes or maximizes a function with a superscript. Ex: x = argmin f(x). 8

Gradient Descent Suppose we have a function, where both and are real numbers. The derivative of this function is denoted as or as /. The derivative gives the slope of at the point. It specifies how to scale a small change in the input in order to obtain the corresponding change in the output:. sign is less than for small enough. We can thus reduce by moving in small steps with opposite sign of the derivative. This technique is called gradient descent. 9

Gradient Descent Figure 4.1: An illustration of how the gradient descent algorithm uses the derivatives of a function can be used to follow the function downhill to a minimum. 10

Critical Points Points where 0 are known as critical points or stationary points. A local minimum is a point where is lower than at all neighboring points, so it is no longer possible to decrease by making infinitesimal steps. A local maximum is a point where is higher than at all neighboring points. Some critical points are neither maxima nor minima. These are known as saddle points. A point that obtains the absolute lowest value of is a global minimum. 11

Critical Points Figure 4.2: Examples of each of the three types of critical points in 1 D. A critical point is a point with zero slope. Such a point can either be a local minimum, which is lower than the neighboring points, a local maximum, which is higher than the neighboring points, or a saddle point, which has neighbors that are both higher and lower than the point itself. 12

Approximate Optimization Figure 4.3: Optimization algorithms may fail to find a global minimum when there are multiple local minima or plateaus present. In the context of deep learning, we generally accept such solutions even though they are not truly minimal, so long as they correspond to significantly low values of the cost function. 13

Gradient Descent For functions with multiple inputs, we must make use of the concept of partial derivatives. The partial derivative measures how changes as only the variable increases at point. The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector: the gradient of is the vector containing all of the partial derivatives, denoted. Element of the gradient is the partial derivative of with respect to. In multiple dimensions, critical points are points where every element of the gradient is equal to zero. 14

Gradient Descent The partial derivative measures how changes as only the variable increases at point. The directional derivative in direction (a unit vector) is the slope of the function in direction. In other words, the directional derivative is the derivative of the function with respect to, evaluated at 0. Using the chain rule, we can see that when 0. evaluates to To minimize, we would like to find the direction in which decreases the fastest. 15

Gradient Descent is the angle between and the gradient. Substituting in and ignoring factors that do not depend on, this simplifies to. This is minimized when points in the opposite direction as the gradient. We can decrease by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent. 16

Steepest Descent Steepest descent proposes a new point is the learning rate, a positive scalar determining the size of the step. Steepest descent converges when every element of the gradient is zero (or, in practice, very close to zero). 0 for 17

4.3.1 Beyond the Gradient Jacobian matrix of :, Second derivative: a derivative of a derivative Example: : The derivative with respect to of the derivative of with respective to is denoted as In a single dimension, we can denote by We can think of the second derivative as measuring curvature. 18

Curvature Figure 4.4: The second derivative determines the curvature of a function. Here we show quadratic functions with various curvature. The dashed line indicates the value of the cost function we would expect based on the gradient information alone as we make a gradient step downhill. In the case of negative curvature, the cost function actually decreases faster than the gradient predicts. In the case of no curvature, the gradient predicts the decrease correctly. In the case of positive curvature, the function decreases slower than expected and eventually begins to increase, so steps that are too large can actually increase the function inadvertently. 19

Hessian Matrix When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix is defined such that, Equivalently, the Hessian is the Jacobian of the gradient.,,, Hessian matrix is symmetric at such points. 20

Saddle Points Figure 4.5: A saddle point containing both positive and negative curvature. The function in this example is. Along the axis corresponding to, the function curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue. Along the axis corresponding to, the function curves downward. This direction is an eigenvector of the Hessian with negative eigenvalue. The name saddle point derives from the saddlelike shape of this function. This is the quintessential example of a function with a saddle point. 21

Gradient Descent and Poor Conditioning Figure 4.6: Gradient descent fails to exploit the curvature information contained in the Hessian matrix. Here we use gradient descent to minimize a quadratic function f( x) whose Hessian matrix has condition number 5. This means that the direction of most curvature has five times more curvature than the direction of least curvature. In this case, the most curvature is in the direction [1, 1] T and the least curvature is in the direc on [1, 1] T. The red lines indicate the path followed by gradient descent. This very elongated quadratic function resembles a long canyon. Gradient descent wastes time repeatedly descending canyon walls, because they are the steepest feature. 22

4.4 Constrained Optimization We may wish to find the maximal or minimal value of f(x) for values of x in some set S. This is known as constrained optimization. Points x that lie within the set S are called feasible points in constrained optimization terminology. 23

Constrained Optimization Generalized Lagrangian or generalized Lagrange function. 24