Generalized Gradient Descent Algorithms

Similar documents
Vector Derivatives and the Gradient

Pseudoinverse & Moore-Penrose Conditions

Pseudoinverse and Adjoint Operators

Normed & Inner Product Vector Spaces

Real Vector Derivatives, Gradients, and Nonlinear Least-Squares

Introduction to gradient descent

Math 273a: Optimization Netwon s methods

nonrobust estimation The n measurement vectors taken together give the vector X R N. The unknown parameter vector is P R M.

MATH 4211/6211 Optimization Quasi-Newton Method

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

1 Computing with constraints

Constrained Optimization

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Sparse Levenberg-Marquardt algorithm.

Nonlinear Optimization: What s important?

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

14. Nonlinear equations

Numerical Methods I Solving Nonlinear Equations

Unconstrained Optimization

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

OR MSc Maths Revision Course

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

ECE580 Partial Solution to Problem Set 3

Gradient Descent. Dr. Xiaowei Huang

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

13. Nonlinear least squares

Lecture 3. Optimization Problems and Iterative Algorithms

Moore-Penrose Conditions & SVD

Vector Space Concepts

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Optimal control problems with PDE constraints

Math 411 Preliminaries

Overfitting, Bias / Variance Analysis

Neural Network Training

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

Chapter 3 Numerical Methods

10.34: Numerical Methods Applied to Chemical Engineering. Lecture 7: Solutions of nonlinear equations Newton-Raphson method

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Linear and Nonlinear Optimization

Pseudoinverse & Orthogonal Projection Operators

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Mathematical Economics. Lecture Notes (in extracts)

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Advanced Techniques for Mobile Robotics Least Squares. Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz

Numerical solutions of nonlinear systems of equations

Performance Surfaces and Optimum Points

Robot Mapping. Least Squares. Cyrill Stachniss

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

NonlinearOptimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Nonlinear Programming

The Steepest Descent Algorithm for Unconstrained Optimization

Unconstrained optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Algorithms for Constrained Optimization

Math 273a: Optimization Basic concepts

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

Method 1: Geometric Error Optimization

OPER 627: Nonlinear Optimization Lecture 2: Math Background and Optimality Conditions

Visual SLAM Tutorial: Bundle Adjustment

Stochastic Analogues to Deterministic Optimizers

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

5 Handling Constraints

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

5 More on Linear Algebra

Functions of Several Variables

Constrained optimization: direct methods (cont.)

Why should you care about the solution strategies?

COMP 558 lecture 18 Nov. 15, 2010

Machine Learning Lecture 5

The Randomized Newton Method for Convex Optimization

1 Kalman Filter Introduction

The K-FAC method for neural network optimization

Logistic Regression. Seungjin Choi

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Machine Learning. Lecture 2: Linear regression. Feng Li.

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Convex Optimization Algorithms for Machine Learning in 10 Slides

Link lecture - Lagrange Multipliers

Lecture 2: Linear Algebra Review

Scientific Computing: Optimization

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Module 04 Optimization Problems KKT Conditions & Solvers

MATH 4211/6211 Optimization Basics of Optimization Problems

Lecture 6: Bayesian Inference in SDE Models

Transcription:

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/1 Lecture 11 ECE 275A Generalized Gradient Descent Algorithms

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 2/1 Continuous Dynamic Minimization of l(x) Let l( ) : X = R n R be twice differentiable with respect to x R n and bounded from below by 0, l(x) 0 for all x. Recalling that l(x) = l(x) ẋ, set ẋ = Q(x) c xl(x) with c x l(x) = ( l(x) ) T = Cartesian gradient of l(x) and Q(x) = Q(x) T for all x. This yields l(x) = c x l(x) 2 Q (x) 0 with l(x) = 0 if and only if c xl(x) = 0 Thus continuously evolving the value of x according to ẋ = Q(x) c xl(x) ensures that the cost l(x) dynamically decreases in value until a stationary point turns off learning.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 3/1 Generalized Gradient Descent Algorithm A family of algorithms for discrete-step dynamic minimization of l(x) can be developed as follows. Setting x k = x(t k ), approximate ẋ k = ẋ(t k ) by the first-order forward difference This yields ẋ k x k+1 x k t k+1 t k x k+1 x k + α k ẋ k with α k = t k+1 t k which suggests the Generalized Gradient Descent Algorithm ˆx k+1 = ˆx k α k Q(ˆx k ) c xl(ˆx k ) A vast body of literature in Mathematical Optimization Theory (aka Mathematical Programming) exists which gives conditions on step size α k to guarantee that a generalized gradient descent algorithm will converge to a stationary value of l(x) for various choices of Q(x) = Q(x) T > 0. Note that a Generalized Gradient Algorithm turns off once the sequence of estimates has converged to a stationary point of l(x): ˆx k ˆx with c xl(ˆx) = 0.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 4/1 Generalized Gradient Descent Algorithm Cont. IMPORTANT SPECIAL CASES: Q(x) = I Gradient Descent Algorithm Q(x) = H 1 (x) Newton Algorithm Q(x) = Ω 1 x Natural Gradient Algorithm The (Cartesian or naive) Gradient Descent Algorithm is simplest to implement, but slowest to converge. The Newton Algorithm is most difficult to implement, due to the difficulty in constructing and inverting the Hessian, but fastest to converge. The Natural (or true) Gradient Descent Algorithm provides improved convergence speed over (naive) gradient descent.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 5/1 Derivation of the Newton Algorithm A Taylor series expansion of l(x) about a current estimate of a minimizing point ˆx k to second-order in x = x ˆx k yields l(ˆx k + x) l(ˆx k ) + l(ˆx k) Minimizing the above wrt x results in x + 1 2 xt H(x 0 ) x x k = H 1 (ˆx k ) c xl(ˆx k ) Finally, updating the minimizing point estimate via yields the Newton Algorithm. ˆx k+1 = ˆx k + α k xk = ˆx k+1 = ˆx k H 1 (ˆx k ) c x l(ˆx k) As mentioned, the Newton Algorithm generally yields fast convergence. This is particularly true if it can be stabilized using the so-called Newton step-size α k = 1.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 6/1 Nonlinear Least-Squares For the Linear Gaussian model, with known covariance matrix C = W 1, the negative log-likelihood is l(x) = log p x (y). = 1 2 y h(x) 2 W The (Cartesian) gradient is given by c xl(x) = ( ) l(x) T, or c xl(x) = H T (x)w (y h(x)) where H(x) is the Jacobian (linearization) of h(x) H(x) = h(x) This yields the Nonlinear Least-Squares Generalized Gradient Descent Algorithm ˆx k+1 = ˆx k + α k Q(ˆx k )H T (ˆx k ) (y h(ˆx k ))

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 7/1 Nonlinear Least-Squares Cont. Note that the Natural (true) gradient of l(x) is given by x l(x) = Ω 1 x HT (x)w (y h(x)) = H (x)(y h(x)) where H (x) is the adjoint of H(x). A stationary point x, x l(x) = 0, must satisfy the Nonlinear Normal Equation H (x)h(x) = H (x)y and the prediction error e(x) = y h(x) must be in N(H (x)) = R(H(x)) H (x)e(x) = H (x) (y h(x)) = 0 Provided that the step size α k is chosen to stabilize the nonlinear least-squares generalized gradient descent algorithm, the sequence of estimates ˆx k will converge to a stationary point of l(x), ˆx k ˆx with x l(ˆx) = 0 = c xl(ˆx).

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 8/1 Nonlinear Least-Squares Cont. IMPLEMENTING THE NEWTON ALGORITHM One computes the Hessian from H(x) = ( l(x) ) T = c xl(x) This yields the Hessian for the Weighted Least-Squares Loss Function m H(x) = H T (x)wh(x) H i (x)[w (y h(x))] i i=1 where ( hi (x) ) T H i (x) denotes the Hessian of the the i-th scalar-valued component of the vector function h(x). Note that all terms on the right-hand-side of of the Hessian expression are symmetric, as required if H(x) is to be symmetric.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 9/1 Nonlinear Least-Squares Cont. Evidently, the Hessian matrix of the least-squares loss function l(x) can be quite complex. Also note that because of the second term on the right-hand-side of the Hessian expression, H(x) can become singular or indefinite. However, as we have seen in the special case when h(x) is linear, h(x) = Hx, we have that H(x) = H and H i(x) = 0, i = 1, n, yielding, H(x) = H T WH, which for full column-rank A and positive W is always symmetric and invertible. Also note that if we have a good model and a value ˆx such that the prediction error e(ˆx) = y h(ˆx) 0, then H(ˆx) H T (ˆx)WH(ˆx)

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 10/1 Nonlinear Least-Squares Gauss-Newton Algorithm Linearizing h(x) about a current estimate ˆx k with x = x ˆx k we have h(x) h(ˆx k ) + h(ˆx k) x = h(ˆx k ) + H(ˆx k ) This yields the loss-function approximation l(x) = l(ˆx k + x) 1 2 (y h(ˆx k)) H(ˆx k ) x 2 W Assuming that H(ˆx k ) has full column rank (which is guaranteed if h(x) is one-to-one in a neighborhood of ˆx k ), then we can minimize l(ˆx k + x) wrt x to obtain x k = ( H T (ˆx k )WH(ˆx k )) 1 H T (ˆx k )W (y h(ˆx k )) The update rule ˆx k+1 = ˆx k + α k xk yields the a nonlinear least-squares generalized gradient descent algorithm known as the Gauss-Newton Algorithm.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 11/1 Gauss-Newton Algorithm Cont. The Gauss-Newton algorithm corresponds to taking the weighting matrix Q(x) = Q T (x) > 0 in the nonlinear least-squares generalized gradient descent algorithm to be (under the assumption that h(x) is locally one-to-one) Q(x) = ( H T (x)wh(x)) 1 Gauss-Newton Algorithm Note that when e(x) = y h(x) 0, the Newton and Gauss-Newton algorithms become essentially equivalent. Thus it is not surprising that the Gauss-Newton algorithm can result in very fast convergence, assuming that e(ˆx k ) becomes asymptotically small. This is particularly true if the Gauss-Newton algorithm can be stabilized using the Newton step-size α k = 1.