GRADIENT = STEEPEST DESCENT

Similar documents
Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

5 Overview of algorithms for unconstrained optimization

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Conditional Gradient (Frank-Wolfe) Method

Unconstrained minimization of smooth functions

10. Unconstrained minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Unconstrained minimization: assumptions

Quasi-Newton Methods

Lecture 3: Linesearch methods (continued). Steepest descent methods

Unconstrained optimization

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Line Search Methods for Unconstrained Optimisation

CPSC 540: Machine Learning

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Subgradient Method. Ryan Tibshirani Convex Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Higher-Order Methods

Nesterov s Optimal Gradient Methods

R-Linear Convergence of Limited Memory Steepest Descent

Introduction to gradient descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Math 164: Optimization Barzilai-Borwein Method

Optimization Tutorial 1. Basic Gradient Descent

14. Nonlinear equations

Lecture 7 Unconstrained nonlinear programming

Unconstrained minimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

8 Numerical methods for unconstrained problems

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

An Optimal Affine Invariant Smooth Minimization Algorithm.

Nonlinear Optimization for Optimal Control

CPSC 540: Machine Learning

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Math 273a: Optimization Netwon s methods

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

5 Quasi-Newton Methods

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

1 Numerical optimization

Nonlinear Optimization: What s important?

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

5. Subgradient method

Programming, numerics and optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 5: September 12

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

17 Solution of Nonlinear Systems

Numerical Methods. V. Leclère May 15, x R n

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

On Nesterov s Random Coordinate Descent Algorithms - Continued

Proximal and First-Order Methods for Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

The Steepest Descent Algorithm for Unconstrained Optimization

Non-convex optimization. Issam Laradji

MATH 4211/6211 Optimization Quasi-Newton Method

The Conjugate Gradient Method

Gradient Methods Using Momentum and Memory

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 14: October 17

Lecture 7: September 17

Algorithms for constrained local optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

and P RP k = gt k (g k? g k? ) kg k? k ; (.5) where kk is the Euclidean norm. This paper deals with another conjugate gradient method, the method of s

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Primal/Dual Decomposition Methods

Why should you care about the solution strategies?

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

Chapter 4. Unconstrained optimization

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

You should be able to...

Selected Topics in Optimization. Some slides borrowed from

More First-Order Optimization Algorithms

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Constrained Optimization

Expanding the reach of optimal methods

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Math 273a: Optimization Subgradient Methods

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

1 Numerical optimization

Convex Optimization Lecture 16

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Transcription:

GRADIENT METHODS

GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5

GRADIENT DESCENT METHOD Iso-contours gradient 0.5 Gradient descent x k+ = x k rf(x k ) 0.4 0.3 0.2 0. 0 negative gradient 0. 0.2 0.3 0.4 0.5.5 0.5 0 0.5

STEPSIZE RESTRICTION f(x) = 2 x2 rf(x) = x 0.9 x k+ = x k ( x k ) 0.8 0.7 x k = 2 Restriction: 0.6 0.5 < 2 0.3 0.4 0.2 = 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

RECALL Lipschitz Constant for gradient krf(x) rf(y)k applemkx yk f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 When Hessian exists: M kr 2 f(x)k

STABILITY RESTRICTION If you know the Lipschitz Constant for the gradient f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 f(x rf(x)) apple f(x) rf(x) T rf(x)+ M 2 gradient step =f(x) krf(x)k 2 + M 2 2 krf(x)k2 =f(x)+ M 2 2 krf(x)k 2 2 2 krf(x)k2 M 2 2 < 0 ) < 2 M proves convergence in terms of gradient

LINE SEARCH METHODS Choose search direction: d SEARCH for stepsize that satisfies SOME inequality: Update iterate: x k+ = x k + d Armijo condition f(x k + d) apple f(x k )+ ( d) T rf(x k ), < x k f(x k )+ d T rf(x k ) f(x k )+ d T rf(x k )

WOLFE CONDITIONS Choose search direction: d = rf(x k ) Armijo condition don t go too far Find stepsize satisfying Wolf conditions f(x k + d) apple f(x k )+ ( d) T rf(x k ), < d T rf(x k + d) > d T rf(x k ), < < Update iterate: x k+ = x k + d curvature condition don t go too near Theorem Suppose we have the uniform bound for a convex function then rf(x k ) T d krf(x k )kkdk >c lim k! rf(xk )! 0

LINE SEARCH IN REAL LIFE Backtracking/Armijo line search Choose search direction: d = rf(x k ) While f(x k + d) f(x k )+ ( d) T rf(x k ) /2 Update iterate x k+ = x k + d Exact line search Choose search direction: d = rf(x k ) =min f(x k + d) Update iterate x k+ = x k + d

ADAPTIVE STEPSIZE METHOD Consider this simple model problem f(x) = 2 xt x x k+ = x k rf(x k )=x k ( x k ) Best choice: = Barzilai-Borwein Model: f(x) 2 xt x Figure out best approximation

BB METHOD Barzilai-Borwein Model: f(x) 2 xt x rf(y) rf(x) (y x) On each iteration, define: g = rf(x k+ ) rf(x k ) x = x k+ x k The model tells us g x min k x gk2 2

BB METHOD min k x gk2 2 x T x x T g =0 = xt g x T x = = xt x x T g

BB METHOD Choose search direction: d = rf(x k ) Compute BB step: k = xt x x T g While f(x k + k d) f(x k )+ k d T rf(x k ) For some smooth problems, BB method is superlinear, i.e. for large enough k k k /2 Update iterate: x k+ = x k + d f(x k+ ) f? apple C(f(x k+ ) f? ) p for some C>0, and p> This rate is asymptotic, and not global

CONVERGENCE RATE: EXACT LINE SEARCH Strong convexity f(y) f(x)+(y x) T rf(x)+ m 2 ky xk2 Lipschitz bound f(y) apple f(x)+(y x) T rf(x)+ M 2 ky xk2 y = x k+,x= x k y x = x k+ x k = rf(x k ) f(x k+ ) apple min f(x k ) rf(x k ) T rf(x k )+ M 2 2 krf(xk )k 2

CONVERGENCE RATE: EXACT LINE SEARCH f(x k+ ) apple min f(x k ) rf(x k ) T rf(x k )+ M 2 2 krf(xk )k 2 = M f(x k+ ) apple f(x k ) 2M krf(xk )k 2 f(x k+ ) f? apple f(x k ) f? 2M krf(xk )k 2 Optimality gap Write in terms of optimality gap so we get relative change

STRONG CONVEXITY BOUND f(x k+ ) f? apple f(x k ) f? 2M krf(xk )k 2 Strong convexity constant f(y) f(x)+(y x) T rf(x)+ m 2 ky xk2 f(x? ) Optimality: min y f(x k )+(y x k ) T rf(x k )+ m 2 ky xk k 2 y x k = m rf(xk ) f(x? ) f(x k ) m krf(xk )k 2 + 2m krf(xk )k 2 2m(f(x k ) f(x? )) apple krf(x k )k 2

STRONG CONVEXITY BOUND f(x k+ ) f? apple f(x k ) f? f(x k+ ) f? apple f(x k ) f? m 2M krf(xk )k 2 2m(f(x k ) f(x? )) apple krf(x k )k 2 M (f(xk ) f? ) f(x k+ ) f? apple m M (f(x k ) f? ) What s this?

STRONG CONVEXITY BOUND f(x k+ ) f? apple m M (f(x k ) f? ) By induction f(x k ) f? apple m M k (f(x 0 ) f? ) Theorem Gradient with exact line search satisfies linear convergence when the objective is strongly convex k f(x k ) f? apple (f(x 0 ) f? ) apple Condition number

GRADIENT DESCENT USES CONTOURS Gradients are perpendicular to contours apple =2 0 8 6 4 2 0 2 4 6 8 0 0 8 6 4 2 0 2 4 6 8 0

POOR CONDITIONING: CONTOUR INTERPRETATION apple = 00 Contours are (almost) parallel! 0.03 0.8 0.6 0.02 0.4 0.0 0.2 0 0 0.2 0.4 0.0 0.6 0.02 0.8 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0.03 0.5 0.5 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6

GRADIENT DESCENT: WEAKLY CONVEX PROBLEMS All methods so far assume strong convexity For strongly convex problems: f(x k ) f? apple m M k (f(x 0 ) f? ) Strong convexity constant For weakly convex problems: f(x k+ ) f? apple Mkx0 x? k 2 k + Slow (worst case)

WHAT S THE BEST WE CAN DO? Goal: put worst-case bound on convergence of first-order methods Definition: a method is first-order if x k 2 span{x 0, rf(x 0 ), rf(x ),, rf(x k )} Theorem: For any first-order method, there is a smooth function with Lipschitz constant M such that f(x k ) f(x? ) 3Mkx 0 x? k 2 32(k + ) 2 Nemirovsky and Yudin, 82

PROOF The worst function in the world f n (x) = L 4 2 x2 + 2 n X i= (x i x i+ ) 2 + 2 x2 n x minimizer x? i = i/(n + ) f n (x? )= L 8 n +

PROOF The worst function in the world f n (x) = L 4 2 x2 + 2 f n (x? )= L 8 n + n X i= (x i x i+ ) 2 + 2 x2 n x After iteration k x i =0, for i>k Consider function with n=2k variables f 2k (x k )=f k (x k L ) 8 k + f 2k (x? )= L 8 2k +

PROOF Consider function with n=2k variables f 2k (x k )=f k (x k L ) 8 k + f 2k (x? )= L 8 2k +2 lowest possible energy on step k f 2k (x k ) f 2k (x? ) kx 0 x? k 2 L 8 k+ 2k+2 3 (2k + 2) = true solution 3L 32(k + ) 2 start at 0

Nemirovski and Yundin 83 NESTEROV S METHOD ACHIEVES THE BOUND Gradient Nesterov x 0 x 0 x x 2 x 4 x 3 x x 4 x 2 x3 Optimal f(x k+ ) f? apple kx 0 x? k 2 (k + ) f(x k+ ) f? apple 2kx 0 x? k 2 (k + ) 2

NESTEROV S METHOD < L rf, 0 = x k+ = y k rf(y k ) Momentum parameter k+ = +p 4( k ) 2 + 2 y k+ = x k+ + k k+ (xk+ x k ) Prediction step Momentum term

OPTIMAL CONVERGENCE Theorem: The objective error of the kth iterate of Nester s method is bounded by f(x k ) f(x? ) apple 2Lkx0 x? k 2 (k + ) 2 This worst case doesn t mean much in practice adaptive convergence is faster

POOR CONDITIONING: FUNCTION INTERPRETATION f(x) = X d i 2 x2 i = x T Dx d i = 3 d i = 50 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0. 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

POOR CONDITIONING: FUNCTION INTERPRETATION d i = 3 f(x) = X d i 2 x2 i = x T Dx x k+ i = x i d i x k One stepsize to rule them all d i = 50 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.6 Big step 0.5 0.4 0.5 0.4 = 50 0.3 0.3 0.2 Small step = 50 0.2 0. 0. 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

THIS IS ALWAYS HAPPENING! minimize minimize 2 xt Hx + g T x EVD 2 xt UDU T x + g T x y U T x minimize 2 yt Dy +(U T g) T x = X d i 2 y2 i +(U T g) i y i

PRECONDITIONERS x = Py minimize f(x) minimize f(py) H Hessians P T HP

THE BEST PRECONDITIONER x = Py minimize f(x) minimize f(py) H Hessians P = H /2 P T HP P T HP = H /2 HH /2 = I

SVD INTERPRETATION f(x) = 2 xt Hx = 2 xt UDU T x = 2 xt UD /2 D /2 U T x = 2 yt y x = UD /2 y = Py y = D /2 U T x Rotation Stretch 0 0 8 8 6 6 4 2 0 P 4 2 0 2 2 4 4 6 6 8 8 0 0 8 6 4 2 0 2 4 6 8 0 0 0 8 6 4 2 0 2 4 6 8 0

NEWTON S METHOD x = H /2 y minimize f(x) minimize f(h /2 y) gradient step y k+ = y k H /2 rf(h /2 y k ) change variables back to x H /2 y k+ = H /2 y k H /2 H /2 rf(h /2 y k ) x k+ = x k H rf(x k ) Newton direction There are many interpretations

GEOMETRIC INTERPRETATION minimize f(x) f(x k )+(x x k ) T g +(x x k ) T H(x x k ) Derivative: g + H(x x k )=0 x k+ = x k H g f(x k )+(x x k ) T g +(x x k ) T H(x x k ) f(x) x k x k+

CLASSICAL NEWTON METHOD Newton s method = Algorithm for root finding g(x) =0 x k+ = x k g(x k ) g 0 (x k ) For nonlinear system of equations x k+ x k g(x) g(x k )+(x x k ) T rg(x k ) x k+ = x k rg(x k ) g(x)

NEWTON METHOD FOR OPTIMIZATION Taylor s theorem rf(x) =0 f(x) (x x k ) T rf(x k )+ 2 (x xk ) T r 2 f(x k )(x x k ) Derivative Linear approximation of gradient rf(x) rf(x k )+r 2 f(x k )(x x k )=g + H(x x k )=0 x k+ = x k H g

SIMPLE RATE ANALYSIS Bound the error: e k = x k x? Assume f 2 C 2, e k is sufficiently small, and strong convexity. 0=rf(x? )=rf(x k e k )=rf(x k ) He k + O(ke k k 2 ) Multiply by inverse hessian 0=H rf(x k ) e k + O(ke k k 2 ) note: e k+ = e k H rf(x k ) e k+ = O(ke k k 2 ) Fletcher, Practical Methods of Optimization

DAMPED NEWTON METHOD Choose search direction: d = r 2 f(x k ) rf(x k ) Find stepsize satisfying Wolf conditions f(x k + d) apple f(x k )+ ( d) T rf(x k ), < Update iterate: x k+ = x k + d < = = Damped step = Full Newton Predicted objective change in Newton direction

DAMPED NEWTON METHOD Theorem Suppose we have a Lipschitz constant for the Hessian kr 2 f(x) r 2 f(y)k applel H kx yk When the gradient gets small enough to satisfy krf(x k )kapple3( 2 ) m2 L H.the unit stepsize is an Armijo step, and ( = ) L H 2m 2 krf(xk+ )kapple LH 2m 2 krf(xk )k 2 see Boyd and Vandenberghe

NEWTON IN THE WILD In practice Hessian might be indefinite If you start at a point of negative curvature, the Newton goes up! (negative) search direction: must form an acute angle with the gradient g T d>0 We can use any SPD matrix to approximate Hessian d = Ĥ g g T d = g T Ĥ g>0

MODIFIED HESSIAN We can use any SPD matrix to approximate Hessian Levenberg Marquardt Ĥ = H + I Add a large enough multiple of the identity to the Hessian that it becomes SPD Modified Cholesky Begin with partial Cholesky I 0 H! L 0 B L T Replace B with a SPD matrix See Fang & O leary, 06