Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex function f min x R n f(x) The algorithm: choose x 0 and repeat x k+1 = x k t k f(x k ), k = 0, 1, 2,... Note that p k := f(x k ) is a descent direction. (we call p k a descent direction if p k f k < 0) question: how to choose step size t k?

Shiqian Ma, MAT-258A: Numerical Optimization 3 step size rules exact line search: t k = argmin t f(x k t f(x k )) fixed: t k constant backtracking line search (most practical) Backtracking line search: initialize t k at some ˆt > 0 (for example, ˆt = 1), repeat t k := βt k until f(x t k f(x)) < f(x) αt k f(x) 2 2 This is called Amijo condition. Two parameters: 0 < β < 1 and 0 < α 0.5

Shiqian Ma, MAT-258A: Numerical Optimization 4 General line search at x k, compute a descent direction p k, what step size to take to move? x k+1 = x k + t k p k exact line search: min t f(x k + tp k ) Amijo condition: f(x k + tp k ) f(x k ) + c 1 t f(x k ) p k Wolfe condition: f(x k + tp k ) f(x k ) + c 1 t f(x k ) p k f(x k + tp k ) p k c 2 f(x k ) p k with 0 < c 1 < c 2 < 1.

Shiqian Ma, MAT-258A: Numerical Optimization 5 3.1.1. Analysis of gradient method x k+1 = x k t k f(x k ), k = 0, 1, 2,... with fixed step size or backtracking line search assumptions: f is convex and differentiable with dom f = R n f(x) is Lipschitz continuous with parameter L > 0 optimal value f = inf x f(x) is finite and attained at x

Shiqian Ma, MAT-258A: Numerical Optimization 6 Lipschitz continuity: a function h is called Lipschitz continuous with Lipschitz constant L, if h(x) h(y) L x y, x, y dom h. If f(x) is Lipschitz-continuous with parameter L > 0, then (quadratic upper bound) f(y) f(x) + f(x) (y x) + L 2 x y 2 2, x, y dom f if dom f = R n and f has a minimizer x, then 1 2L f(x) 2 2 f(x) f(x ) L 2 x x 2 2

Shiqian Ma, MAT-258A: Numerical Optimization 7 Strongly convex f is strongly convex with parameter µ > 0 if f(x) µ 2 x 2 2 is convex First-order condition f(y) f(x) + f(x) (y x) + µ 2 x y 2 2, x, y dom f Second-order condition 2 f(x) µi, x dom f if dom f = R n, then f has a minimizer x, and µ 2 x x 2 2 f(x) f(x ) 1 2µ f(x) 2 2

Shiqian Ma, MAT-258A: Numerical Optimization 8 3.1.2. Analysis of constant step size recall quadratic upper bound: f(y) f(x) + f(x), y x + L 2 y x 2 2 plug in y = x t f(x) to obtain f(x t f(x)) f(x) t let x + = x t f(x) and assume 0 < t 1/L, ( 1 Lt ) f(x) 2 2 2 f(x + ) f(x) t 2 f(x) 2 2 f + f(x), x x t 2 f(x) 2 2 = f + 1 2t ( x x 2 2 x x t f(x) 2 2) = f + 1 2t ( x x 2 2 x + x 2 2)

Shiqian Ma, MAT-258A: Numerical Optimization 9 take x = x i 1, x + = x i, t i = t, and the bounds for i = 1,..., k: k ( i=1 f(x i ) f ) 1 k ( 2t ( i=1 x i 1 x 2 2 x i x 2) 2 x 0 x 2 2 x k x 2) 2 = 1 2t since f(x i ) is non-increasing, f(x k ) f 1 k 1 2t x0 x 2 2 k i=1 ( f(x i ) f ) 1 2kt x0 x 2 2 conclusion: number of iterations to reach f(x k ) f ɛ is O(1/ɛ)

Shiqian Ma, MAT-258A: Numerical Optimization 10 3.1.3. Analysis for strongly convex functions faster convergence rate with additional assumption of strong convexity analysis for exact line search: recall from quadratic upper bound f(x t f(x)) f(x) t(1 Lt 2 ) f(x) 2 2 use x + = argmin t f(x t f(x)) to obtain f(x + ) f(x 1 1 f(x)) f(x) L 2L f(x) 2 2 subtract f from both sides f(x + ) f f(x) f 1 2L f(x) 2 2 now use strong convexity: f(x) f 1 2µ f(x) 2 2 f(x + ) f (1 µ L )(f(x) f )

Shiqian Ma, MAT-258A: Numerical Optimization 11 therefore f(x k ) f (1 µ L )k (f(x 0 ) f ) conclusion: number of iterations to reach f(x k ) f ɛ is log((f(x 0 ) f )/ɛ) L ( f(x 0 log(1 µ/l) 1 µ log ) f ) ɛ roughly proportional to condition number L/µ when it is large

Shiqian Ma, MAT-258A: Numerical Optimization 12 A quadratic example quadratic example f(x) = (1/2)(x 2 1 + γx 2 2) (γ > 1) with exact line search, starting at x 0 = (γ, 1) f(x k ) = ( ) 2k γ 1 f(x 0 ) γ + 1 x k x 2 = x 0 x 2 ( γ 1 γ + 1 ) k if γ = 10 4, k = 100, then ( γ 1 γ+1) k = 0.98 gradient method can be very slow, and very much dependent on scaling

Shiqian Ma, MAT-258A: Numerical Optimization 13 The contour is more eccentric, and gradient method is slow.

Shiqian Ma, MAT-258A: Numerical Optimization 14 A non-quadratic example f(x 1, x 2 ) = exp(x 1 + 3x 2 0.1) + exp(x 1 3x 2 0.1) + exp( x 1 0.1) its gradient is f(x 1, x 2 ) = ( ) exp(x1 + 3x 2 0.1) + exp(x 1 3x 2 0.1) exp( x 1 0.1) 3 exp(x 1 + 3x 2 0.1) 3 exp(x 1 3x 2 0.1)

Shiqian Ma, MAT-258A: Numerical Optimization 15 run gradient method with line search (α = 0.1, β = 0.7). upper: f(x k ) f ; lower: f(x k ) 2

Shiqian Ma, MAT-258A: Numerical Optimization 16 Convergence rate sublinear rate: r k c/k p linear rate: r k c(1 q) k quadratic rate: r k+1 crk 2 r k can be f(x k ) f, x k x 2, or f(x k ) 2 ; c is some constant

Shiqian Ma, MAT-258A: Numerical Optimization 17 3.2. Newton s Method assume f(x) is twice continuously differentiable and convex given x k, use a quadratic function to approximate f(x) locally: use Taylor s expansion: f(x k + p k ) = f(x k ) + f(x k ) p k + 1 2 pk 2 f(x k )p k choose p k such that this quadratic approximation is minimized: (pure) Newton method damped Newton method f(x k ) + 2 f(x k )p k = 0 x k+1 = x k 2 f(x k ) 1 f(x k ) x k+1 = x k t k 2 f(x k ) 1 f(x k )

Shiqian Ma, MAT-258A: Numerical Optimization 18 advantages: fast convergence, affine invariance affine invariant means: independent of linear changes of coordinates for example: Newton iterates for f(y) = f(t y) with starting point y 0 = T 1 x 0 are y k = T 1 x k disadvantages: requires second derivatives, solution to linear equation can be too expensive for large-scale applications

Shiqian Ma, MAT-258A: Numerical Optimization 19 Quasi-Newton Method It is too expensive to compute 2 f(x k ), use some B k to replace it. use a quadratic model to approximate f(x) locally at x k : m k (p) = f(x k ) + f(x k ) p + 1 2 p B k p here B k is an n n symmetric positive definite matrix. Note that the function value and gradient of this model at p = 0 match f(x k ) and f(x k ), respectively. by minimizing this quadratic approximation, we obtain then we update the iterate via p k = B 1 k f(xk ) x k+1 = x k + α k p k

Shiqian Ma, MAT-258A: Numerical Optimization 20 how to compute B k? when we are at x k+1, we want to construct m k+1 (p) = f(x k+1 ) + f(x k+1 ) p + 1 2 p B k+1 p we want the gradient of m k+1 to match the gradient of f at x k and x k+1 : m k+1 (0) = f(x k+1 ) and so we have, m k+1 ( α k p k ) = f(x k+1 ) α k B k+1 p k = f(x k ) to simplify the notation, define B k+1 α k p k = f(x k+1 ) f(x k ) s k = x k+1 x k = α k p k, y k = f(x k+1 ) f(x k )

Shiqian Ma, MAT-258A: Numerical Optimization 21 then we get B k+1 s k = y k and this is called secant equation

Shiqian Ma, MAT-258A: Numerical Optimization 22 To compute B k+1, we solve DFP and BFGS min B B k s.t., B = B, Bs k = y k This gives the following DFP updating formula (originally given by Davidon in 1959, and subsequently studied by Fletcher and Powell) B k+1 = (I ρ k y k s k )B k (I ρ k s k y k ) + ρ k y k y k with ρ k = 1/y k s k. The other way to compute B k+1 : denote its inverse as H k+1, solve the following problem min H H k s.t., H = H, Hy k = s k

Shiqian Ma, MAT-258A: Numerical Optimization 23 This gives the following BFGS updating formula (proposed by Broyden, Fletcher, Goldfarb and Shanno, independently) H k+1 = (I ρ k s k y k )H k (I ρ k y k s k ) + ρ k s k s k or, by Sherman-Morrison-Woodbury formula: B k+1 = B k B ks k s k B k s k B ks k + y ky k y k s k

Shiqian Ma, MAT-258A: Numerical Optimization 24 The complete BFGS method given initial point x 0 and B 0 0 repeat for k = 0, 1, 2,... until a stopping criterion is satisfied compute quasi-newton direction p k = B 1 k f(xk ) determine step size t k (via backtracking line search) update x k+1 = x k + t k p k and compute f(x k+1 ) compute B k+1

Shiqian Ma, MAT-258A: Numerical Optimization 25 Convergence of BFGS method global convergence if f is strongly convex, then BFGS with backtracking line search converges to the optimum for any x 0 and B 0 0 local convergence if f is strongly convex and 2 f(x) is Lipschitz continuous, then local convergence is superlinear: for sufficiently large k, where c k 0. x k+1 x 2 c k x k x 2