Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Similar documents
5 Quasi-Newton Methods

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Lecture 14: October 17

2. Quasi-Newton methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

10. Unconstrained minimization

Optimization II: Unconstrained Multivariable

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Convex Optimization CMU-10725

Unconstrained optimization

Unconstrained minimization

Convex Optimization. Problem set 2. Due Monday April 26th

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Chapter 4. Unconstrained optimization

Descent methods. min x. f(x)

Lecture 18: November Review on Primal-dual interior-poit methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

MATH 4211/6211 Optimization Quasi-Newton Method

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimization and Root Finding. Kurt Hornik

Conditional Gradient (Frank-Wolfe) Method

Unconstrained minimization of smooth functions

Optimization II: Unconstrained Multivariable

Nonlinear Programming

Selected Topics in Optimization. Some slides borrowed from

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Maria Cameron. f(x) = 1 n

Programming, numerics and optimization

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

5. Subgradient method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Search Directions for Unconstrained Optimization

Lecture 7 Unconstrained nonlinear programming

Quasi-Newton methods for minimization

Math 408A: Non-Linear Optimization

Data Mining (Mineria de Dades)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

1 Numerical optimization

Lecture 5: September 15

Higher-Order Methods

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

8 Numerical methods for unconstrained problems

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Numerical solutions of nonlinear systems of equations

Stochastic Quasi-Newton Methods

Statistics 580 Optimization Methods

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Algorithms for Constrained Optimization

CPSC 540: Machine Learning

6. Proximal gradient method

Static unconstrained optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Lecture V. Numerical Optimization

1 Numerical optimization

Lecture 1: Supervised Learning

14. Nonlinear equations

Quasi-Newton Methods

Line search methods with variable sample size. Nataša Krklec Jerinkić. - PhD thesis -

Optimization Tutorial 1. Basic Gradient Descent

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 5: September 12

Line Search Methods for Unconstrained Optimisation

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

Empirical Risk Minimization and Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

Gradient-Based Optimization

GRADIENT = STEEPEST DESCENT

Lecture 14 Ellipsoid method

The Randomized Newton Method for Convex Optimization

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Lecture 6: September 12

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

6. Proximal gradient method

Multivariate Newton Minimanization

Empirical Risk Minimization and Optimization

Stochastic Optimization Algorithms Beyond SG

Matrix Secant Methods

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

NonlinearOptimization

Numerical Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

1. Introduction and motivation. We propose an algorithm for solving unconstrained optimization problems of the form (1) min

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Transcription:

Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex function f min x R n f(x) The algorithm: choose x 0 and repeat x k+1 = x k t k f(x k ), k = 0, 1, 2,... Note that p k := f(x k ) is a descent direction. (we call p k a descent direction if p k f k < 0) question: how to choose step size t k?

Shiqian Ma, MAT-258A: Numerical Optimization 3 step size rules exact line search: t k = argmin t f(x k t f(x k )) fixed: t k constant backtracking line search (most practical) Backtracking line search: initialize t k at some ˆt > 0 (for example, ˆt = 1), repeat t k := βt k until f(x t k f(x)) < f(x) αt k f(x) 2 2 This is called Amijo condition. Two parameters: 0 < β < 1 and 0 < α 0.5

Shiqian Ma, MAT-258A: Numerical Optimization 4 General line search at x k, compute a descent direction p k, what step size to take to move? x k+1 = x k + t k p k exact line search: min t f(x k + tp k ) Amijo condition: f(x k + tp k ) f(x k ) + c 1 t f(x k ) p k Wolfe condition: f(x k + tp k ) f(x k ) + c 1 t f(x k ) p k f(x k + tp k ) p k c 2 f(x k ) p k with 0 < c 1 < c 2 < 1.

Shiqian Ma, MAT-258A: Numerical Optimization 5 3.1.1. Analysis of gradient method x k+1 = x k t k f(x k ), k = 0, 1, 2,... with fixed step size or backtracking line search assumptions: f is convex and differentiable with dom f = R n f(x) is Lipschitz continuous with parameter L > 0 optimal value f = inf x f(x) is finite and attained at x

Shiqian Ma, MAT-258A: Numerical Optimization 6 Lipschitz continuity: a function h is called Lipschitz continuous with Lipschitz constant L, if h(x) h(y) L x y, x, y dom h. If f(x) is Lipschitz-continuous with parameter L > 0, then (quadratic upper bound) f(y) f(x) + f(x) (y x) + L 2 x y 2 2, x, y dom f if dom f = R n and f has a minimizer x, then 1 2L f(x) 2 2 f(x) f(x ) L 2 x x 2 2

Shiqian Ma, MAT-258A: Numerical Optimization 7 Strongly convex f is strongly convex with parameter µ > 0 if f(x) µ 2 x 2 2 is convex First-order condition f(y) f(x) + f(x) (y x) + µ 2 x y 2 2, x, y dom f Second-order condition 2 f(x) µi, x dom f if dom f = R n, then f has a minimizer x, and µ 2 x x 2 2 f(x) f(x ) 1 2µ f(x) 2 2

Shiqian Ma, MAT-258A: Numerical Optimization 8 3.1.2. Analysis of constant step size recall quadratic upper bound: f(y) f(x) + f(x), y x + L 2 y x 2 2 plug in y = x t f(x) to obtain f(x t f(x)) f(x) t let x + = x t f(x) and assume 0 < t 1/L, ( 1 Lt ) f(x) 2 2 2 f(x + ) f(x) t 2 f(x) 2 2 f + f(x), x x t 2 f(x) 2 2 = f + 1 2t ( x x 2 2 x x t f(x) 2 2) = f + 1 2t ( x x 2 2 x + x 2 2)

Shiqian Ma, MAT-258A: Numerical Optimization 9 take x = x i 1, x + = x i, t i = t, and the bounds for i = 1,..., k: k ( i=1 f(x i ) f ) 1 k ( 2t ( i=1 x i 1 x 2 2 x i x 2) 2 x 0 x 2 2 x k x 2) 2 = 1 2t since f(x i ) is non-increasing, f(x k ) f 1 k 1 2t x0 x 2 2 k i=1 ( f(x i ) f ) 1 2kt x0 x 2 2 conclusion: number of iterations to reach f(x k ) f ɛ is O(1/ɛ)

Shiqian Ma, MAT-258A: Numerical Optimization 10 3.1.3. Analysis for strongly convex functions faster convergence rate with additional assumption of strong convexity analysis for exact line search: recall from quadratic upper bound f(x t f(x)) f(x) t(1 Lt 2 ) f(x) 2 2 use x + = argmin t f(x t f(x)) to obtain f(x + ) f(x 1 1 f(x)) f(x) L 2L f(x) 2 2 subtract f from both sides f(x + ) f f(x) f 1 2L f(x) 2 2 now use strong convexity: f(x) f 1 2µ f(x) 2 2 f(x + ) f (1 µ L )(f(x) f )

Shiqian Ma, MAT-258A: Numerical Optimization 11 therefore f(x k ) f (1 µ L )k (f(x 0 ) f ) conclusion: number of iterations to reach f(x k ) f ɛ is log((f(x 0 ) f )/ɛ) L ( f(x 0 log(1 µ/l) 1 µ log ) f ) ɛ roughly proportional to condition number L/µ when it is large

Shiqian Ma, MAT-258A: Numerical Optimization 12 A quadratic example quadratic example f(x) = (1/2)(x 2 1 + γx 2 2) (γ > 1) with exact line search, starting at x 0 = (γ, 1) f(x k ) = ( ) 2k γ 1 f(x 0 ) γ + 1 x k x 2 = x 0 x 2 ( γ 1 γ + 1 ) k if γ = 10 4, k = 100, then ( γ 1 γ+1) k = 0.98 gradient method can be very slow, and very much dependent on scaling

Shiqian Ma, MAT-258A: Numerical Optimization 13 The contour is more eccentric, and gradient method is slow.

Shiqian Ma, MAT-258A: Numerical Optimization 14 A non-quadratic example f(x 1, x 2 ) = exp(x 1 + 3x 2 0.1) + exp(x 1 3x 2 0.1) + exp( x 1 0.1) its gradient is f(x 1, x 2 ) = ( ) exp(x1 + 3x 2 0.1) + exp(x 1 3x 2 0.1) exp( x 1 0.1) 3 exp(x 1 + 3x 2 0.1) 3 exp(x 1 3x 2 0.1)

Shiqian Ma, MAT-258A: Numerical Optimization 15 run gradient method with line search (α = 0.1, β = 0.7). upper: f(x k ) f ; lower: f(x k ) 2

Shiqian Ma, MAT-258A: Numerical Optimization 16 Convergence rate sublinear rate: r k c/k p linear rate: r k c(1 q) k quadratic rate: r k+1 crk 2 r k can be f(x k ) f, x k x 2, or f(x k ) 2 ; c is some constant

Shiqian Ma, MAT-258A: Numerical Optimization 17 3.2. Newton s Method assume f(x) is twice continuously differentiable and convex given x k, use a quadratic function to approximate f(x) locally: use Taylor s expansion: f(x k + p k ) = f(x k ) + f(x k ) p k + 1 2 pk 2 f(x k )p k choose p k such that this quadratic approximation is minimized: (pure) Newton method damped Newton method f(x k ) + 2 f(x k )p k = 0 x k+1 = x k 2 f(x k ) 1 f(x k ) x k+1 = x k t k 2 f(x k ) 1 f(x k )

Shiqian Ma, MAT-258A: Numerical Optimization 18 advantages: fast convergence, affine invariance affine invariant means: independent of linear changes of coordinates for example: Newton iterates for f(y) = f(t y) with starting point y 0 = T 1 x 0 are y k = T 1 x k disadvantages: requires second derivatives, solution to linear equation can be too expensive for large-scale applications

Shiqian Ma, MAT-258A: Numerical Optimization 19 Quasi-Newton Method It is too expensive to compute 2 f(x k ), use some B k to replace it. use a quadratic model to approximate f(x) locally at x k : m k (p) = f(x k ) + f(x k ) p + 1 2 p B k p here B k is an n n symmetric positive definite matrix. Note that the function value and gradient of this model at p = 0 match f(x k ) and f(x k ), respectively. by minimizing this quadratic approximation, we obtain then we update the iterate via p k = B 1 k f(xk ) x k+1 = x k + α k p k

Shiqian Ma, MAT-258A: Numerical Optimization 20 how to compute B k? when we are at x k+1, we want to construct m k+1 (p) = f(x k+1 ) + f(x k+1 ) p + 1 2 p B k+1 p we want the gradient of m k+1 to match the gradient of f at x k and x k+1 : m k+1 (0) = f(x k+1 ) and so we have, m k+1 ( α k p k ) = f(x k+1 ) α k B k+1 p k = f(x k ) to simplify the notation, define B k+1 α k p k = f(x k+1 ) f(x k ) s k = x k+1 x k = α k p k, y k = f(x k+1 ) f(x k )

Shiqian Ma, MAT-258A: Numerical Optimization 21 then we get B k+1 s k = y k and this is called secant equation

Shiqian Ma, MAT-258A: Numerical Optimization 22 To compute B k+1, we solve DFP and BFGS min B B k s.t., B = B, Bs k = y k This gives the following DFP updating formula (originally given by Davidon in 1959, and subsequently studied by Fletcher and Powell) B k+1 = (I ρ k y k s k )B k (I ρ k s k y k ) + ρ k y k y k with ρ k = 1/y k s k. The other way to compute B k+1 : denote its inverse as H k+1, solve the following problem min H H k s.t., H = H, Hy k = s k

Shiqian Ma, MAT-258A: Numerical Optimization 23 This gives the following BFGS updating formula (proposed by Broyden, Fletcher, Goldfarb and Shanno, independently) H k+1 = (I ρ k s k y k )H k (I ρ k y k s k ) + ρ k s k s k or, by Sherman-Morrison-Woodbury formula: B k+1 = B k B ks k s k B k s k B ks k + y ky k y k s k

Shiqian Ma, MAT-258A: Numerical Optimization 24 The complete BFGS method given initial point x 0 and B 0 0 repeat for k = 0, 1, 2,... until a stopping criterion is satisfied compute quasi-newton direction p k = B 1 k f(xk ) determine step size t k (via backtracking line search) update x k+1 = x k + t k p k and compute f(x k+1 ) compute B k+1

Shiqian Ma, MAT-258A: Numerical Optimization 25 Convergence of BFGS method global convergence if f is strongly convex, then BFGS with backtracking line search converges to the optimum for any x 0 and B 0 0 local convergence if f is strongly convex and 2 f(x) is Lipschitz continuous, then local convergence is superlinear: for sufficiently large k, where c k 0. x k+1 x 2 c k x k x 2