ISyE 6663 (Winter 2017) Nonlinear Optimization

Size: px

Start display at page:

Download "ISyE 6663 (Winter 2017) Nonlinear Optimization"

Angela Rodgers
5 years ago
Views:

1 Winter 017 TABLE OF CONTENTS ISyE 6663 (Winter 017) Nonlinear Optimization Prof. R. D. C. Monteiro Georgia Institute of Technology L A TEXer: W. KONG Last Revision: May 3, 017 Table of Contents Index 1 1 Review of Concepts Unconstrained Optimization Convexity Projection onto Convex Sets Algorithms 8.1 Steepest Descent Projected Gradient Method Gradient-type Methods Inexact Line Search Newton s Method Conjugate Gradient Method General Conjugate Gradient Method Nesterov s Method Quasi-Newton Methods Constrained Optimization General NLPs Augmented Lagrangian Methods Global Method Barrier Methods Interior Point Methods Interior Point Algorithm Duality Dual Function Existence of G.M. s Augmented Lagrangian Method vs. Duality These notes are currently a work in progress, and as such may be incomplete or contain errors. i

2 Winter 017 ACKNOWLEDGMENTS ACKNOWLEDGMENTS: Special thanks to Michael Baker and his L A TEX formatted notes. They were the inspiration for the structure of these notes. ii

3 Winter 017 ABSTRACT Abstract The purpose of these notes is to provide the reader with a secondary reference to the material covered in ISyE iii

4 Winter REVIEW OF CONCEPTS 1 Review of Concepts 1.1 Unconstrained Optimization Definition 1.1. For a set S R n and f : S R, an optimization problem can be formulated as which we will call the standard minimization problem. Example 1.1. Here are some basic examples: (1) S = x R n : Ax b} min(max) f(x) s.t. x S () S = x R n : h(x) = 0, g(x) 0, x X} where h : R n R m, g : R n R r, and X R n is simple Remark 1.1. If S = R n then the problem is unconstrained, otherwise if S R n then it is constrained. Example 1.. (Least Squares) For error e i = y i ˆf(t i ) and trial function ˆf(t) = x 1 +x exp( x 3 t), a constrained optimization problem is min m m e i = (y i x 1 x e x3ti ) i=1 s.t. x 3 0 i=1 x 1 + x = 1 Definition 1.. x S is a [strict] global minimum (optimal solution) of the standard minimization problem if f(x) [>] f(x ) [x x ] for all x S. Similar definitions follow for maximization problems. Notation. We will denote: B(x ; ε) = x R n : x x < ε} B(x ; ε) = x R n : x x ε} Definition 1.3. x S is a [strict] local minimum of the standard minimization problem ε > 0 such that f(x) [>] f(x ) [x x ] for all x S B(x, ε). Definition 1.4. S is compact iff S is closed and bounded. Theorem 1.1. (Weierstrass) If S is compact and f is continuous on S, then the standard minimization problem has a global minimum. Corollary 1.1. If S is closed and f is continuous on S and lim x,x S f(x) = then the standard minimization problem has a global minimum. The condition lim x,x S f(x) = is instead required for maximization problems. Note that: lim f(x) = ( M 0, r 0 s.t. x > r, x S = f(x) > M) x,x S ( M 0, r 0 s.t. f(x) M = x S, x r) x S : f(x) M} B(0, r) M 0, x S : f(x) M} is bounded. Proof. (Sketch) Pick x 0 S such that M = f(x 0 ) and remark that x S : f(x) f(x 0 )} is compact. The rest follows from Weierstrass. Definition 1.5. Given S = R n, f : R n R, x R n, the gradient of f at x is f( x) = ( f ( x),..., f ) T ( x) R n x 1 x n

5 Winter REVIEW OF CONCEPTS Remark 1.. (Interpretations) (1) In the set x R n : f(x) = f( x)} the gradient lies perpendicular to this set and points in the direction of steepest ascent. () The graph of the function f is (x, f(x)) R n+1 : x R n } and the gradient defines a linear approximation at x given by t = f( x) + f( x), x x. In particular, ( ) T ( ) f( x) x x 0 = 1 t f( x) Proposition 1.1. x is a local minimum of the standard optimization problem and f is differentiable at x = f(x ) = 0. Proof. Let d R n be given. For every t > 0 sufficiently small, 0 f(x +td) f(x) t as t 0 + we get f(x ), d = f(x ) T d 0 for any d R n. This is only the case for when f(x ) = 0 as the case for d = f(x ) = f(x ) 0. Definition 1.6. H R n n is positive semi-definite if x T Hx 0 for all x R n (Notation H 0). It is positive definite if x T Hx > 0 for all x R n, x 0. Fact 1.1. If f is twice continuously differentiable at x, then is symmetric. f(x) = f (x) = [ ] f (x) x i x j ij Proposition 1.. x is a local minimum of the standard optimization problem and f is twice continuously differentiable at x (or f C (R)) = f(x ) = 0 and f(x ) 0. Proof. Note that f(x + h) = f(x) + f(x) T h + 1 ht f(x)h + r(h) h lim r(h) = 0 h 0 or equivalently f(x + h) = f(x) + f(x) T h + 1 ht f(x + th)h for some t (0, 1). The case for f(x ) = 0 has already been shown so let H = f(x ) and d R n. We want to show that d T Hd 0. We have for t > 0 sufficiently small, from the first expansion. Dividing by t gives us Taking t 0 yields 0 d T Hd. 0 f(x + td) f(x ) = t f(x) T d + 1 }} t d T Hd + t r(td) d =0 0 1 dt Hd + r(td) d. Example 1.3. The converse is generally not true. Consider the case f(x) = x 3 which satisfies the first and second order conditions at x = 0 but does not have a local minimum at that point. Theorem 1.. Assume that f C and x R n is such that f(x ) = 0, f(x ) > 0. Then x is a strict local minimizer of the standard minimization problem. Proof. Let H = f(x ). By Weierstrass Theorem, choose α > 0 such that u T Hu α for all u R n such that u 1. We have f(x + h) f(x ) = 1 ht Hh + r(h) h lim r(h) = 0 h 0

6 Winter REVIEW OF CONCEPTS which implies that δ > 0 such that h δ = r(h) α 4 f(x + h) f(x ) = h [ 1 and hence, if h δ, we have ( h T h [ α α 4 ) ( ) ] h H + r(h) h h ] = 1 4 α h Hence, if 0 < h δ then f(x + h) f(x ) > 0. So, x is a local minimum of the standard minimization problem. Example 1.4. The above condition is not necessary and the converse is not true. Consider the function f(x) = x 4 at x = Convexity Definition 1.7. C R n is a convex set if (x, y) := tx + (1 t)y : t (0, 1)} C for all x, y C. Here are some properties: 1) If C i } i I is a collection of convex sets in R n then i I C i is convex. ) If T : R n R m is affine, C R n, and D R m then T (C), T 1 (D) are convex. 3) C i R ni convex for i = 1,,..., r implies that C 1... C r is convex 4) C i R n convex for i = 1,,..., r implies that C C r (Minkowski sum) is convex 5) C R n is convex, α R implies that αc is convex 6) C convex implies that cl(c) and int(c) are convex Example 1.5. Here are some examples: 1) Hyperplane: 0 u R n, β R define H = H(u, β) := x R n : u T x = β} ) Half-spaces: H + = x R n : u T x β}, H = x R n : u T x β} 3) Polyhedra: i H i Proposition 1.3. If C is convex then n i=1 α ix i C for x i C, α i 0, i = 1,,..., n, and n i=1 α i = 1. Proof. (Can be done by induction, using convexity) Definition 1.8. Let C R n be a convex set and f(x) be a unction defined on C. A function f is convex on C if for all x, y C, t (0, 1) we have f(tx + (1 t)y) tf(x) + (1 t)f(y) The function f is strictly convex if the above inequality if the above holds strictly whenever x y. Definition 1.9. f is β-strongly convex (β > 0) on C if for all x, y C and t (0, 1). f(tx + (1 t)y) tf(x) + (1 t)f(y) β t(1 t) x y Proposition 1.4. f is β-strongly convex iff f β is strongly convex. Proof. (Left as an exercise) Proposition 1.5. If fis convex on C then for every α R, the sets are convex. x C : f(x) < α} x C : f(x) α}

7 Winter REVIEW OF CONCEPTS Proposition 1.6. The following are equivalent (1) f is convex on C () x, t C R : f(x) t} is convex () x, t C R : f(x) < t} is convex Proof. Left as an exercise to the reader. Proposition 1.7. (Jensen s inequality) Assume f is convex on C. If x 1,..., x r C with r i=1 α i = 1, α 1,..., α r 0 then f ( r i=1 α ix i) r i=1 α if(x i ). Proof. Since f is convex, the set U = (x, t) C R : f(x) t} is convex. Clearly, (x i, f(x i )) T U for all i = 1,..., r. So i α i(x i, f(x i )) T = ( i α ix i, i α if(x i )) T U and hence f ( r i=1 α ix i) r i=1 α if(x i ) from the definition of U. Notation 1. For Ω R n we write f C 1 (Ω) if f is continuously differentiable of every x Ω. Proposition 1.8. For Ω R n convex and f C 1 (Ω) the following are equivalent: (a) f is (strictly) convex on Ω (b) f(y)(>) f(x) + f(x) T (y x), x, y Ω (x y) (c) [ f(y) f(x)] T (y x)(>) 0, x, y Ω (x y) Proof. [(a) = (b)] Let x, y Ω be given. For all t (0, 1), we have f(ty + (1 t)x) tf(y) + (1 t)f(x) = f(x + t(y x)) f(x) + t[f(y) f(x)] f(x + t(y x)) f(x) = f(y) f(x) t t 0 = f(x) T (y x) f(y) f(x) [(b) = (a)] Let x, y Ω be given and z t = ty + (1 t)x Ω. Then by (b), f(y) f(z t ) + f(z t ) T (y z t ) (1) f(x) f(z t ) + f(z t ) T (x z t ) () and (t)(1) + (1 t)() yields tf(y) + (1 t)f(x) f(z t ) + f(z t ) T (ty + (1 t)x z t ) = f(z t ) = f(ty + (1 t)x) [(b) = (c)] Just add the two inequalities: f(y) f(x) + f(z t ) T (y x) f(x) f(y) + f(z t ) T (x y) [(c) = (b)] For some t (0, 1), f(y) f(x) = f(z t ) T (y x)

8 Winter REVIEW OF CONCEPTS where z t = x + t(y x). Since z t x = t(y x) we have f(y) f(x) f(x) T (y x) = [ f(z t ) f(x)] T (y x) = 1 t [ f(z t) f(x)] T (z t x) 0 Proposition 1.9. f is β-strongly convex (β > 0) iff f β is convex. Proposition For Ω R n convex, f C 1 (Ω) and β R, the following are equivalent: (a) f β is convex (b) x, y Ω, f(y) f(x) + f(x) T (y x) + β y x (c) x, y Ω, [ f(y) f(x)] T (y x) β y x Proof. Define f := f β and remark that f is convex and from a previous result, f(x) = f(x) βx f(y) f(x) + f(x) T (y x), x, y Ω (1) [ f(y) f(x)] T (y x) 0, x, y Ω () and (1) is equivalent to (b) and () is equivalent to (c). Proposition For Ω R n convex, f C 1 (Ω) and M R, the following are equivalent: (a) M f is convex (b) x, y Ω, f(y) f(x) + f(x) T (y x) + M y x (c) x, y Ω, [ f(y) f(x)] T (y x) M y x Proof. Apply the previous proposition with β = M, f = f. Proposition 1.1. Assume Ω R n is convex, f C 1 (Ω) is convex on Ω. The the following are equivalent for x R n : (a) x is a global minimum of f on Ω (b) x is a local minimum of f on Ω (c) f( x) T (x x) 0, x Ω or f ( x; x x) 0 where f ( x; x x) = f( x) T (x x) Proof. [(a) = (b)] Obvious. [(b) = (c)] Since x is a local minimum, f( x + t(x x)) f( x) 0 for t > 0 sufficiently small. If we divide by t and take t 0 then f( x) T ( x x). [(c) = (a)] Let x Ω be given. By (c), f( x) T (x x) 0. By the convexity of f, we have and so x is a global minimum. Remark 1.3. If x int(ω) then (c) f( x) = 0. f(x) f( x) + f(x) T (x x) = f(x) f( x) Proof. ( = ) Assume f( x) 0. We know ɛ > 0 such that B( x; ɛ) Ω and f( x) T (x x) 0, x Ω, x B( x; ɛ) from (c). Now x := x ɛ f( x) f( x) B( x; ɛ) and substituting this into the above equation yields 0 ɛ f( x) < 0 leading to a contradiction.

9 Winter REVIEW OF CONCEPTS Proposition If Ω R n is convex, f C 1 (Ω) is strictly convex on Ω then f has at most one global minimum. Proof. Assume x Ω is a global minimum of minf(x) : x Ω}. Let x x, x Ω be given. We have f(x) > f( x) + f( x) T (x x) and f( x) T (x x) 0 from a previous result. So f(x) > f( x) and thus x is the only global minimum. Proposition If Ω is convex, f C 1 (Ω), f( ) is L Lipschitz continuous on Ω (i.e. f(y) f(x) L x y for all x, y Ω), then The second set of inequalities is proven by Cauchy-Schwarz. L x y f(y) [f(x) + f(x) T (y x)] L x y, L x y [ f(y) f(x)] T (y x) L x y. Proposition If Ω R n is closed and convex, and f C 1 (Ω) is β strongly convex. Then, f = inff(x) : x Ω} x has a unique optimal solution x and f(x) f + β x x, x Ω Proof. Take x 0 Ω. Since f is β strongly convex, we have f(x) f(x 0 ) + f(x) T (x x 0 ) + β x x 0 for all x Ω. Hence, as x, x Ω, we will have f(x). Thus, inff(x) : x Ω} has a unique optimal solution x. Hence, f(x ) T (x x ) 0 for all x Ω and f(x) f + β x x, x Ω. 1.3 Projection onto Convex Sets Definition For Ω R n closed and convex, x R n, we define } 1 Π Ω (x) = argmin y x : y Ω} = argmin y y y x : y Ω as the projection of x onto Ω. The latter definition is useful because the 1 function is strongly convex. Corollary 1.. Using the previous definition and x, y x T y, (1) Π Ω is well-defined () x = Π Ω (x) y x, x x 0, y Ω (3) x 1 x, Π Ω (x 1 ) Π Ω (x ) Π Ω (x 1 ) Π Ω (x ) and hence x 1 x Π Ω (x 1 ) Π Ω (x ), x 1, x Ω. That is, Π Ω is non-expansive.

10 Winter REVIEW OF CONCEPTS Proof. (1) is obvious. For (), let f(y) = 1 y x. Then, For (3), define x i = Π Ω(x i ), i = 1,. We have and adding the two above inequalities yields x = Π Ω (x) x argminf(y) : y Ω} y f(x ) T (y x ) 0, y Ω (x y) T (y x ) 0, y Ω. (x 1 x 1) T (x x ) 0 (x x ) T (x 1 x 1) 0 [(x 1 x ) (x 1 x )] T (x x 1) 0 = x 1 x (x x 1 ) T (x x 1) x x 1 x x 1. Remark 1.4. If Ω is closed convex, x Ω, and we define the normal cone of x as N Ω ( x) = n R n : n T (y x) 0, y Ω}, then the second condition of the previous propositions says 0 x + N Ω (x ) x. Remark 1.5. If f is convex [I m assuming we need this], then the problem min y f(y) : y Ω} is equivalent to 0 f(x ) + N Ω (x ). This follows from the fact that the optimality condition for the problem is f(x ) T (y x ) 0, y Ω f(x ) N Ω (x ). Proposition Assume Ω R n convex and f C 1 (Ω). Then, (a) f(x) 0, x Ω = f is convex on Ω. (b) f is convex on Ω and int Ω = f(x) 0, x Ω. (c) f(x) > 0, x Ω = f is strictly convex on Ω. Proof. (a) Let x, y Ω. We will show f(y) f(x) + f(x) T (y x). We have f(y) = f(x) + f(x) T (y x) + 1 (y x)t f(ξ)(y x) for some ξ = x + t(x y) and t (0, 1). Clearly ξ Ω and hence f(ξ) 0. So d T f(ξ)d 0, d R n and the result follows. (b) By contradiction, assume x Ω such that f(x) 0. Without loss of generality, we may assume that x int Ω from the fact that Ω cl(int (Ω)). From our assumption, we know λ min [ f(x)] < 0 and d R n, d T f(x)d < 0. By continuity, ɛ > 0 such that d T f(y)d < 0, y B(x, ɛ). Take x = x + ɛd. Then, f( x) = f(x) + f(x) T ( x x) + 1 ( x x)t f(y)( x x) for y = x + t( x x) B(x, ɛ) and t (0, 1) and hence (c) Same as (a) except we use strictness. f( x) < f(x) + f(x) T ( x x)

11 Winter 017 ALGORITHMS Corollary 1.3. Assume Ω R n is convex, f C (Ω). For m, M R, we have mi f(x) MI f( ) m and M f( ) are convex m y x f(y) [f(x) + f(x) T (y x)] M y x m y x [ f(y) f(x)] T (y x) M y x Algorithms Definition.1. d R n is a descent direction at x if δ > 0 such that t (0, δ) we have f(x + td) < f(x). Lemma.1. If f(x) T d < 0 then d is a descent direction at x. Example.1. We may select d = f(x) or d = D f(x) where D 0 as long as f(x) 0. Definition.. A line search method is an algorithm with an update of the form x k+1 = x k + α k d k where d k is a descent direction at x k and α k is a positive step size. Definition.3. The trust region method has the following principle: α k? = argmin f(x k + d k ) : t > 0}. t [0,ᾱ] That is, given x k R n we approximate f(x k + p) m k (p) where m k (p) is a simple function (e.g. f(x h ) + f(x h ) T p) and solve p k = argmin p Tk R nm k(p)} (e.g. T k = B(0, δ k )). If f(x k + p k ) is close to m k (p k ) then we iterate x k+1 = x k + p k. Otherwise, we reject x k + p k with x k+1 = x k and shrink T k. Closeness can be defined with where ρ k 1 implies that our estimate is close. ρ k = m k(0) f(x k + p k ) m k (0) m k (p k ) = f(x k) f(x k + p k ) f(x k ) m k (p k ).1 Steepest Descent Definition.4. For a function f C 1 (R n ) which has L Lipschitz continuous gradient, the steepest descent with fixed step size method is that for given x 0 R n and θ (0, ), we update with x k = x k 1 θ L f(x k 1) k k + 1 Proposition.1. Assume that f(x k ) f in a steepest descent method. Then for all k > 1 we have min f(x i 1) f(x ( ) 0) f L 1 i k k θ( θ)

12 Winter 017 ALGORITHMS Proof. For all i 1, using our update step, we have f(x i ) f(x i 1 ) f(x i 1 ) T (x i x i 1 ) + L x i x i 1 θ L f(x i 1) + θ L f(x i 1) = θ ( L f(x i 1) 1 θ ) Sof(x i 1 ) f(x i ) θ( θ) L f(x i 1) and summing for i = 1,,..., k we get The result follows after a simple re-arrangement. θ( θ) f(x 0 ) f f(x 0 ) f(x k ) L kθ( θ) L k f(x i 1 ) i=1 min f(x i 1). i=1,,...,k Definition.5. For Ω R n convex, f C 1 (Ω) which has L Lipschitz continuous gradient on Ω, the projected gradient method is that for given x 0 R n and θ (0, ), we update with x k = argmin l f (x; x k 1 ) + L } x θ x x k 1, x Ω k k + 1 where l f (x; x k 1 ) = f(x k 1 ) + f(x k 1 ) T (x x k 1 ). Lemma.. For all k 1, under the projected gradient scheme, we have 0 f(x k 1 ) + N Ω (x k ) + L θ (x k x k 1 ) Proof. Define ϕ k (x) = l f (x; x k 1 ) + L θ x x k 1. We first know that x l f (x; x k 1 ) = f(x k 1 ) and x [ L θ x x k 1 ] = L θ (x x k 1) and x k is optimal if 0 x ϕ k (x k ) + N Ω (x k ). Hence, we must have 0 f(x k 1 ) + N Ω (x k ) + L θ (x k x k 1 ). Lemma.3. Let r k = L θ (x k 1 x k ) and r k = r k + f(x k ) f(x k 1 ). Then r k f(x k ) + N Ω (x k ) and ( ) 1 r k L θ + 1 x k x k 1 Proof. (Simple algebraic manipulation using Lipschitz property.) Remark.1. If we can show that lim inf k r k = 0 then the optimality condition 0 f( x) + N Ω ( x) is approached via x k }. Lemma.4. We have f(x k 1 ) f(x k ) L ( θ θ ) x k x k 1 Proof. (will be shown next class)

13 Winter 017 ALGORITHMS Proposition.. Assume that f(x k ) f for all k 0. Then, for all k 1 we have Proof. We have min r i f(x 0) f 1 i k k f(x 0 ) f f(x 0 ) f(x k ) k = (f(x i 1 ) f(x i )) i=1 ( ) L(θ + 1) θ( θ) L ( ) k θ x i x i 1 θ i=1 L ( ) θ k min θ x i x i 1 1 i k L ( ) ( ) θ 1 kl θ θ + 1 min r i 1 i k. Projected Gradient Method Definition.6. For a space Ω R n which is closed and convex, a function f C 1 (Ω), which has L Lipschitz continuous gradient, the linear approximation of f is defined as l f ( x; x) := f(x) + f(x) T ( x x) where l f ( x; x) = f(x), l f (x; x) = f(x). We have previously seen f( x) l f ( x; x) L x x, x, x Ω. Definition.7. Given a space Ω R n which is closed and convex, a function f C 1 (Ω), which has L Lipschitz continuous gradient, a point x 0 Ω, and θ (0, ), the projected gradient method is x k = argmin l f (x; x k 1 ) + Lθ } x x k 1 (1) x k k + 1 Lemma.5. For all k 1, we have f(x k ) f(x k 1 ) L ( θ θ ) x k x k 1 Proof. By (1), Taking x = x k 1, l f (x; x k 1 ) + L θ x x k 1 l f (x k ; x k 1 ) + L θ x k x k 1 + L θ x x k. () f(x k 1 ) l f (x k ; x k 1 ) + L θ x k x k 1 = l f (x k ; x k 1 ) + L ( L x k x k 1 + θ L ) x k x k 1 f(x k ) + L ( ) θ x k x k 1 θ

14 Winter 017 ALGORITHMS Lemma.6. Given a space Ω R n which is closed and convex, a convex function f C 1 (Ω), which has L Lipschitz continuous gradient, and the set of optimal solutions Ω for the optimization problem for every k 1 and x Ω we have Proof. By (), with θ = 1 and x = x, we have min f(x) s.t. x Ω, L ( x x k 1 x x k ) f(x k ) f. l f (x ; x k 1 ) + L x x k 1 l f (x k ; x k 1 ) + L }} x k x k 1 + L }} x x k () f(x )+ L x x k 1 f(x k )+ L x x k and the result follows after an algebraic re-arrangement. Lemma.7. Under the previous lemma s assumptions, for all k 1 and x Ω, we have Proof. (easy exercise) L ( x x 0 x x k ) k [f(x i ) f ] k [f(x k ) f ] Lemma.8. Under the previous lemma s assumptions, for all k 1 and x Ω, we have i=1 x k x x 0 x f(x k ) f L k x 0 x. Note that if x = P Ω (x 0 ) then d 0 := x 0 P Ω (x ) can be thought of a distance of x 0 to Ω and Proof. (follows from the previous lemma) f(x k ) f Ld 0 k. Lemma.9. Define r k = L θ (x k 1 x k ) + f(x k ) f(x k 1 ). Then r k f(x k ) + N Ω (X k ) where if r k = 0 then x k satisfies the optimality condition of min f(x) s.t. x Ω. Proof. Left as an exercise (?) Definition.8. a k } k=1 R converges geometrically if there exists γ 0 and τ (0, 1) such that a k γτ k, k 1. Note 1. lim k (a k /[1/k p ]) = 0 for p > 0, but the rate at which a k diminishes may be REALLY slow relative to 1/k p.

15 Winter 017 ALGORITHMS Lemma.10. Given a space Ω R n which is closed and convex, a β strongly convex function f C 1 (Ω), which has L Lipschitz continuous gradient, and the set of optimal solutions Ω for the optimization problem min f(x) s.t. x Ω, for every k 1 and x Ω we have ( L 1 β ) k d 0 f(x k ) f. Proof. By (), with θ = 1 and x = x, we have l f (x ; x k 1 ) + L x x k 1 l f (x k ; x k 1 ) + L }} x k x k 1 + L }} x x k () f(x )+ (1 β) x x k 1 f(x k )+ L x x k and the result follows after an algebraic re-arrangement and iterating over k. Exercise.1. Recall r k f(x k ) + N Ω (X k ) and r k L ( ) x k x k 1. θ For θ = 1, we have Show that r k L x k x k 1. ( ) 1 min r i = O i=1,...,k k if f is convex ( ( r k = O 1 β ) ) k if f is β strongly convex L Remark.. For a function f(x) = 1 (x x ) T Q(x x ) + γ, we have L = λ max (Q), β = λ min (Q) and cond(q) = λ max (Q)/λ min (Q) so r k is related to the inverse condition number of Q..3 Gradient-type Methods Problem.1. For standard minimization algorithms of the form x k+1 = x k + α k d k where α k, d k are respective step sizes and descent directions, what conditions on α k }, d k } should we set to ensure convergence? Remark.3. Assuming the function is still L Lipschitz, we know: f(x ) f(x) f(x) T (x x) L x x = f(x k + αd k ) f(x k ) α f(x k ) T d k + L d k. Take where at optimality, we need } α k = argmin α f(x k ) T d k + Lα α R d k f(x k ) T d k + α k d k = 0 = α k = f(x k) T d k L d k > 0.

16 Winter 017 ALGORITHMS Substituting this into the Lipschitz condition yields f(x k ) f(x k+1 ) ( f(x k) T d k ) L d k > 0. Remark.4. Let ɛ k = f(x k) T d k f(x k ) d k where ɛ k = cos θ k and θ k is the angle between d k and f(x k ). Then, f(x k ) f(x k+1 ) ɛ k f(x k) L k 1 = f(x 0 ) f f(x 0 ) f(x k ) f(x i ) f(x i+1 ) i=0 = f(x 0 ) f 1 ( ) ( k 1 min L f(x i) i k 1 i=0 ( f(x0 ) f ) = min i k 1 f(x i) L k 1 i=0 ɛ i. ɛ i ) k 1 i=0 ɛ i f(x i) So if i=0 ɛ i = (e.g. ɛ i ɛ for all i), then lim k min i k f(x i ) = 0 or lim inf h f(x k ) = 0. If ɛ i ɛ for all i, then min f(x i) L ( f(x0 ) f ) i k 1 ɛ. k L Exercise.. If α k = θ f(x k) T d k L d k and θ (0, ), show that f(x k ) f(x k+1 ) (θ θ ) ( ( f(xk ) T d k ) L d k ). Remark.5. If d k = D k f(x k ) and D k is symmetric positive definite, then cond(d n ) 1 ɛ = ɛ k ɛ > 0. The proof makes use of the fact that and with g = f(x), we have λ min (D) u u T Du λ max (D) u Du λ max (D) u ɛ k = gt k d k g k d k = gt k D kg k g k d k λ min(d k ) g k g k λ max (D k ) g k = 1 cond(d k ) ɛ..4 Inexact Line Search Remark.6. Assume now that L is not known or does not exist and define φ k (α) = f(x k + αd k ) f(x k ). We wish to choose α such that φ k (α) σφ k(0) α where σ (0, 1) is a fixed constant, where we wish to not be close to ᾱ, a root of φ. To not be close to 0, there are many strategies: (a) Goldstein rule: For some constant τ (σ, 1), we require α k to satisfy φ k (α) τφ k(0)α ( ) (b) Wolfe-Powell (W-P) rule: For some constant τ (σ, 1), we require α k to satisfy φ k(α) τφ k(0)

17 Winter 017 ALGORITHMS (c) Strong Wolfe-Powell rule: For some constant τ (σ, 1), we require α k to satisfy φ k(α) τφ k(0) with σ < 1. (d) Armijo s rule: Let s > 0 and β (0, 1) be fixed constants. Choose α k as the largest scalar from α s, sβ, sβ,...} such that (*) is satisfied. Proposition.3. With respect to Armijo s rule, 1) δ > 0 such that (*) is satisfied strictly for any α (0, δ). ) If φ k (α) : α > 0} is bounded below, there exists an open interval of α s that satisfy rules (a) to (c). Proof. Left as an exercise. Theorem.1. Suppose that 1) f C 1 (R n ) and there exists L > 0 such that for all y, z x : f(x) f(x 0 )} we have f(y) f(x) L y x. ) f(x k )} is bounded from below. 3) d k } is gradient-related if α k is chosen by Armijo s rule, i.e., there exists δ > 0 such that Then, d k δ f(x k ), k 0 ɛ k f(x k ) < and hence if i=0 ɛ i = then lim inf k f(x k) = 0. k=0 Thus, every accumulation point of x k } is a stationary point. Rates of Convergence Consider the problem min x R n f(x) = 1 xt Qx + c T x + γ } where Q > 0 is symmetric. Steepest Descent The algorithm for our problem is Proposition.4. For every k 0, we have x k+1 = x k α k g k g k = f(x k ) α k = argmin f(x k αg k ) = g k α R gk T Qg k f(x k+1 ) f f(x k ) f ( ) M m = M + m ( ) r 1 r + 1 where m = λ min (Q), M = λ max (Q) and r = M/m = cond(q) 1.

18 Winter 017 ALGORITHMS Proof. (see related proof for the projected gradient) Gradient-type Methods The algorithm for our problem is x k+1 = x k α k D k g k where D k > 0 α k = argmin f(x k αd k g k ) α R Proposition.5. For every k 0, we have f(x k+1 ) f f(x k ) f ( ) Mk m k = M k m k ( ) rk 1 r k + 1 where M k = λ max (D 1/ k QD 1/ k ), m k = λ max (D 1/ k QD 1/ k ), and r k = cond(d 1/ k QD 1/ k ). Proof. We first note that 0 = d dα f(x k + αd k ) = f(x k + αd k ) T d k = [ f(x k ) + α k Qd k ] T d k = f(x T k )d k + α k d T k Qd k implies that α k = f(x k) T d k d T k Qd. Next, if we define f(y) = f(sy) where s = D 1/ k k then f(y) = S f(sy) f(y) = S f(sy)s = SQS. For every k let y = S 1 x k and note by our iteration scheme we have f(y k ) = S f(x k ) = Sg k as well as and Sy k+1 = Sy k α k S f(sy k ) = y k+1 = y k α k f(y k ) α k = argmin f(x k αd k g k ) α R = argmin f(y k αsg k ) α R = argmin α R f(y k α f(y k )). From the previous proposition, f(x k+1 ) f f(x k ) f ( ) Mk m k = M k m k ( ) rk 1 r k + 1 where M k = λ max (D 1/ k QD 1/ k ), m k = λ max (D 1/ k QD 1/ k ), and r k = cond(d 1/ k QD 1/ k ). Remark.7. If r k 1 then For example, if D k Q 1, then the above holds. f(x k+1 ) f lim = 0. k f(x k ) f.5 Newton s Method Newton s Method

19 Winter 017 ALGORITHMS Consider a function h : R n R n where h C 1 (R n ). Newton s method finds a point x R n where h(x) = 0. The idea for a given x k, uses h(x) h(x k ) + h (x k )(x x k ) = 0 = x k+1 = x k h (x k ) 1 h(x k ). In the case of h(x) = f(x) = 0 where h (x) = f(x), we have the iteration scheme x k+1 = x k f(x k ) 1 f(x k ). In general optimization, we may use a second order approximation to f(x) and apply Newton s method to find where f(x) = 0. Local Convergence of Newton s Method Theorem.. Assume h C (R n ) and let x R n be such that h(x ) = 0, h (x ) is non-singular. Then there exists y > 0 such that if x 0 B(x ; y) then x k } obtained as x k+1 = x k [h (x k )] 1 h(x k ) is well-defined and lim x k = x x k+1 x and lim sup k k x k x <. Proof. Let L := h (x ) 1 and choose y > 0 such that for all x B(x ; y) we have - h (x) exists, h (x) 1 L - ηlm < 1 where M = sup x B(x ;y) h (x) It can be shown that Then if x k B(x ; y) we have h (x) h (y) M x y, x, y B(x ; y). x k+1 x = x k x h (x k ) 1 h(x k ) = h (x k ) 1 h(x ) h(x k ) h (x k ) 1 h(x k ). }} =0 So x k+1 x h (x k ) 1 h(x ) h(x k ) h (x k )(x x k ) 1 L [h (x k + t(x x k )) h (x k )] (x x k ) dt L L x x k h (x k + t(x x k )) h (x k ) x x k dt 1 0 = ML x x k MLµ x x k < x k x Mt x x k dt and hence lim x k x = 0. k

20 Winter 017 ALGORITHMS.6 Conjugate Gradient Method Suppose we are dealing with the problem min x R n 1 xt Qx b T x } where Q > 0 is symmetric. Let d 0,..., d n } be a basis for R n, x 0 R n, and denote [d 0,..., d k ] as the subspace spanned by d 0,..., d k. We use the notation g k = f(x k ). Lemma.11. We have x k+1 = argmin f(x) = 1 } xt Qx b T x : x x 0 + [d 1,..., d k ] if and only if x k+1 = x 0 D k ( D T k QD k ) 1 D T k g 0 where D k = [d 0...d k ] R n (k+1) g 0 = f(x 0 ) = Qx 0 b. Also, g k+1 d i for i = 0,..., k. Proof. We know that x x 0 + [d 0,..., d k ] x = x 0 + D k z, for some z R k+1 So x k+1 = x 0 + D k z k+1 where z k+1 = argmin z f(x 0 + D k z) = h(z). In particular, z k+1 solves 0 = h(z) = Dk T f(x 0 + D k jz) = Dk T [Q (x 0 + D k z) b] = (Dk T QD k )z + Dk T g 0. So, z k+1 = (Dk T QD k) 1 Dk T g 0 and the result follows after re-arranging terms and remarking that 0 = Dk T f(x 0 + D k z k+1 ) = Dk T g k+1. Definition.9. A set of directions d 0,..., d k } R n are Q-conjugate if d T i Qd j = 0 for every 0 i < j k. Equivalently, D T k QD k is diagonal. Proposition.6. Suppose that Q > 0 and d 0,..., d k are Q-conjugate vectors. Then d 0,..., d k are linearly independent. Proof. Exercise. Theorem.3. (Expanding Subspace Minimization) Assume that x k+1 = argminf(x) : x x 0 +[d 0,..., d k ]} and that d 0,..., d k 1 are Q-conjugate. Then, (a) x n = x (b) g T k+1 d i = 0 for i = 0,..., k, k 1 (c) x k+1 = x k + α k d k where α k = gt k d k d T k Qd k or equivalently α k = argmin f(x k + αd k ) or equivalently x k+1 = argminf(x) : x x k + [d k ]}. Proof. (a) and (b) are obvious. For (c), note that x k x 0 + [d 0,..., d k 1 ] x 0 + [d 0,..., d k ] and so x k + [d 0,..., d k ] = x 0 + [d 0,..., d k ].

21 Winter 017 ALGORITHMS In the previous algorithms, we can hence replace x 0 with x k. In particular, the first lemma can be replaced with the iteration scheme x k+1 = x k D k ( D T k QD k ) 1 D T k g k. Simplifying with the fact that D T k g k = (g T k d k )e k+1 D T k QD k = diag(d T 1 Qd 1,..., d T k Qd k ) (D T k QD k ) 1 D T k g k = gt k d k d T k QdT k e k+1 where e k+1 is the (k + 1) th basis vector in R n, this then reduces the iteration scheme further to ( ) ( ) g T x k+1 = x k k d k g T d T D k e k+1 = x k k d k k QdT k d T d k = x k α k d k. k QdT k Algorithm 1. (Conjugate Gradient Method [sketch]) Given x 0 R n, let d 0 = g 0 = b Qx 0. For k = 0, 1,,... do x k+1 = x k + α k d k where α k = gt k d k d T k Qd. k If g k+1 = 0, stop; else d k+1 = g k+1 + β k d k where β k = gt k+1 Qd k d T k Qd k. Remark.8. Observe that 0 = d T k+1qd k = ( d k+1 + β k d k ) T Qd k = g T k+1qd k + β k d T k Qd k = β k = gt k+1 Qd k d T k Qd k Lemma.1. (Gram-Schmidt) Assume that d 0,..., d i 1 are Q-conjugate nonzero vectors and p i / [d 0,..., d k 1 ]. Define k 1 d k = p k i=0 p T k Qd i d T i Qd i Then d 0,..., d k are Q-conjugate nonzero vectors and Proof. Exercise. k 1 d i = p k + i=0 [d 0,..., d k ] = [d 0,..., d k 1 p k ]. β k 1,i d i where β k 1 = pt k Qd i d i Qd i. Algorithm. (Alternate Conjugate Gradient) For x 0 R n, f(x) = 1 xt Qx b T x, Q > 0 symmetric, let d 0 = g 0 = b Qx 0. For k = 0, 1,,... do x k+1 = x k + α k d k where α k = gt k d k d T k Qd. k If g k+1 = 0, stop; else d k+1 = g k+1 + k i=1 β kid i where β ki = gt k+1 Qdi d. Here, we are generating the g T i Qdi k [d 0,..., d k 1 ] vectors on the fly and by adapting Gram-Schmidt we have the added bonus that we are preserving Q-conjugacy. Lemma.13. If d 0,..., d k are Q-conjugate and g k+1 / [d 0,..., d k ] then d k+1 as above satisfies (1) d k+1 is Q-conjugate w.r.t. d 0,..., d k } () [d 0,..., d k+1 ] = [d 0,..., d k, g k+1 ] Theorem.4. Assume that g i 0, i 0,..., h}. Then for all i 0, 1,..., k} we have (i) d 0,..., d i are Q-conjugate (ii) g 0,..., g i are orthogonal

22 Winter 017 ALGORITHMS (iii) [d 0,..., d i ] = [g 0,..., g i ] (iv) [d 0,..., d i ] = [g 0, Qg 0,..., Q i g 0 ] (v) α i = g i /(d T i Qd i) and g T i d i = g i Proof. By induction on i. For i = 0, it is obvious. Assume it is true for i 1. Hence, (a) [d 0,..., d i 1 ] = [g 0,..., g i 1 ] = ([g 0, Qg 0,..., Q i g 0 ] = L i 1 ) (b) g 0,..., g i 1 are orthogonal (c) d 0,..., d i 1 are Q-conjugate By our previous lemma, d i is Q-conjugate w.r.t. d 0,..., d i 1 } and so (i) follows. Also by the lemma, we know from (a) which shows (iii). Next, we have [d 0,..., d i ] = [d 0,..., d i 1, g i ] = [g 0,..., g i 1 = g i ] d i [d 0,..., d i 1, g i ] = [g 0,..., Q i 1 g 0, g i ] from (a). Also g i = g i 1 + α i 1 Qd i 1 with g i 1 L i 1 and Qd i 1 QL i 1 = L i so g i L i. This tells us then that d i L i = [d 0,..., d i ] L i. Since d 0,..., d i are linearly independent then [d 0,..., d i ] = L i and (iv) follows. Now we have g i [d 0,..., d i 1 ] since the method is a Q-conjugate direction method. Since [d 0,..., d i 1 ] = [g 0,..., g i 1 ] then g i [g 0,..., g i 1 ] and (ii) follows. For (v) note that d i = g i + u with u L i 1 and hence gi T d i = g i + u T g }} i and the definition of α i follows. =0 Proposition.7. Assume that g k+1 0. Then β ki = gk+1 g k i = k 0 i < k. Proof. By definition β ki = gt k+1 Qdi d T i Qdi Next, and the result follows. and ( ) xi+1 x i Qd i = Q α i = g i+1 g i α i = g T k+1qd i = g k+1 d T i Qd i = d T i Convergence Rate of the Conjugate Gradient Method Note that Now, we have g 0 = Q(x 0 x ) and hence ( ) gi+1 g i α i ( ) gi+1 g i = α i = dt i g i α i = g i α i x x 0 + [d 0,..., d k 1 ] x x 0 + [g 0,..., Q k 1 g 0 ] gk+1 α k i = k 0 i < k. x = x 0 + γ 1 g γ k Q k 1 g 0 for some γ R k x x = x 0 x + γ 1 Q(x 0 x ) γ k Q k (x 0 x ) = (I + γ 1 Q γ k Q k )(x 0 x ) = P k (Q)(x 0 x ) where P k P k and P k is the set of polynomials of degree at most k such that P k (0) = 1. Now we have f(x) = f(x ) + 1 (x x )Q(x x ) so f(x) f(x ) = 1 Q1/ (x x )

23 Winter 017 ALGORITHMS and the original QP is equivalent to (f(x k ) f(x )) = min Q1/ (x x ) min Q 1/ (x x ) s.t. x x 0 + [d 0,..., d k 1 ] = s.t. x x = P k (Q)(x 0 x ) P k P k = min Q1/ P k (Q)(x 0 x ) s.t. P k P k = min P k(q)q 1/ (x 0 x ) s.t. P k P k ( ) min Pk (Q) Q 1/ (x 0 x ) s.t. P k P k Proposition.8. For every k 0, we have and since f(x k ) f f(x 0 ) f where σ(q) is the spectrum of Q or the set of eigenvalues of Q. Corollary.1. For every k 0 and P k P k, we have ( min Pk (Q) s.t. P k P k P k (Q) = max λ σ(q) P k(λ) ) ( ) f(x k ) f max f(x 0 ) f P k(λ). λ σ(q) Corollary.. Assume that Q has m < n distinct eigenvalues. Then x m = x. Proof. Let λ 1,..., λ m be the distinct eigenvalues of Q. Let P m P m be defined as m i=1 P m (λ) = (λ i λ) m i=1 λ i and since P m (λ) = 0 for every λ σ(q) then from the previous proposition, the result follows. Rate of Convergence of CG Method Corollary.3. Assume that Q has (1) (n m) eigenvalues in [a, b], m > 0 () m eigenvalues which are greater than b. Then, f(x m+1 ) f f(x 0 ) f In particular, for m = 0 and a = λ min, b = λ max, we have f(x 1 ) f f(x 0 ) f ( ) b a. a + b ( λmax λ min λ max + λ min Proof. Let λ 1,..., λ m denote the eigenvalues greater than b and define λ m+1 = b+a. Next, define m+1 i=1 P m+1 (λ) = (λ i λ) m+1 i=1 λ i ).

24 Winter 017 ALGORITHMS where clearly P m+1 P m+1. By a previous proposition, f(x m+1 ) f f(x 0 ) f max λ [a,b] P m+1(λ) max λ [a,b] 1 λ a + b = ( ) b a. a + b Corollary.4. For all k 0, we have where r = M/m is the condition number of Q. ( ) f(x k ) f r 1 f(x 0 ) f r + 1 Proof. (sketch) Use the polynomials and define T k (x) = 1 Use a similar procedure as before to obtain the result. ( x + ) k x 1 ( 1 + x k x 1) = cos(k arccos x), Tk (x) 1, x [ 1, 1] ( ) T λ (m+m) k M m P k (λ) = ) P k. T k ( M+m M m.7 General Conjugate Gradient Method Definition.10. Consider the problem ( ) minf(x) : x R n } where f C 1 (R n ). The CG framework, given x 0 R n, is: For k = 0, 1,... do x k+1 = x k + α k d k d k+1 = f(x k+1 ) + β k d k where α k > 0 is the step size (e.g. use an exact or inexact line search method). Recall for convex quadratic, β k = gt k+1 Qd k d T k Qd k = g k+1 g k }} (1) = gt k+1 (g k+1 g k ) g k } } (). Using (1) in the general case leads to the Fletcher-Reeves (FR) method while () leads to the Polak-Ribière (PR) method. Note that using () implies that g T k+1d k+1 = g k+1 < 0. Theorem.5. (PR) Assume that f is such that for 0 < m M, m u u T f(x)u M u for all x, u R n. Then the PR-CG method with exact line search method converges to the unique global minimum of ( ). Theorem.6. Assume that f C (R n ) and x : f(x) f(x 0 )} is bounded. Then there exists an accumulation point x of x k } such that f( x) = 0. If f is convex then x k } x. The Strong Wolfe-Powell inexact line search is used in this scheme where 0 < σ < 1, σ < τ < 1 and f(x k + α k d k ) f(x k ) σα k f(x k ) T d k f(x k + α k d k ) T d k τ f(x k ) T d k

25 Winter 017 ALGORITHMS.8 Nesterov s Method Definition.11. Suppose that f C 1 (R n ) is convex and f(x) is L-Lipschitz where l f ( x; x) f( x) l f ( x; x) + L x x, l f ( x; x) = f(x) + f(x) T ( x x). For the problem minf(x) : x X}, let X be a closed and convex set of optimal set of solutions. The Nesterov Method is as follows: (0) Let x 0 R n be given and set y 0 = x 0, k = 0, A 0 = 0. (1) Compute () Set k k + 1 and go to (1). a k = LA k L A k+1 = A k + a k A k a k x k = y k + x k A k+1 A k+1 y k+1 = argmin l f (x; x k ) + L } x x k x X x k+1 = x k + a k L(y k+1 x k ) Proposition.9. There exists a sequence of affine functions γ k } k 0 such that γ k f and A k f(y k ) min A k Γ k (x) + 1 } x x 0 x k = argmin A k Γ k (x) + 1 } x x 0 x R n ( k 1 ) where Γ k (x) = i=0 a iγ i (x) /A k and γ i = l f (x; x i ). (1) k () k [***Aside: It is important to know that if f is µ-strongly convex, then min f(x) f + µ x x. This will show up on the exam!] Lemma.14. For every k 0 we have Proof. Trivial. k 1 A k = i=0 a i A k+1 Γ k+1 = A k Γ k + α k γ k γ k f A k Γ k A k f Proof. [of previous proposition] We proceed by induction on k. The case for k = 0 is obvious, so assume that it is true for k where (1) k and () k hold. In particular, using the previous lemma, A k Γ k (x) + 1 x x 0 A k f(y k ) + 1 x x k (3)

26 Winter 017 ALGORITHMS and so for all x X (*) we have, using the lemma again, and letting x = x(x) = A ky k +a k x A k+1, since x(x) x k = Directly, we have A k+1 Γ k+1 (x) + 1 x x 0 = A k Γ k (x) + a k γ k (x) + 1 x x 0 (3) A k f(y k ) + 1 x x k + a k γ k (x) A k γ k (x k ) + a k γ k (x) + 1 x x k ( ) Ak y k + a k x = A k+1 γ k + 1 x x k A k+1 = A k+1 γ k ( x) + 1 A k+1 ( x x k ) a k ( = A k+1 γ k ( x) + A ) k+1 x x k a k = A k+1 ( γ k ( x) + L x x k ) = A k+1 [ l f ( x; x k ) + L x x k ] A k+1 [ l f (y k+1 ; x k ) + L y k+1 x k ] a k A k+1 (x x k ). Hence (1) k+1 follows. Next, for () k+1, it is sufficient to show that This is due to the construction of the algorithm: A k+1 Γ k+1 + x k+1 x 0 = 0. A k+1 Γ k+1 = A k Γ k + a k γ k y k+1 = argmin x X () k = x0 x k + a k γ k = x 0 x k+1. = γ k + L(y k+1 x k ) = 0 = γ k = L( x k y k+1 ). γ k (x) + L x x k } Remark.9. For the constrained case where we want (*) to become x R n, take which has the property that min x R n γ k (x) = L( x k y k+1 ), x y k+1 + l f (y k+1 ; x k ) γ k (y k+1 ) = l f (y k+1 ; x k ) γ k = L( x k y k+1 ) γ k (x) + L } x x k The proof can be constructed in the same manner as before. = min x X l f (x; x k ) + L x x k }.

27 Winter 017 ALGORITHMS Corollary.5. For every k 0 and x X we have One can then show that a k λ + λa k for λ = 1/L and hence f(y k ) f 1 A k x x 0 = d 0 A k. ( ) Ak λ A k+1 + = A k+1 λ A k + = A k k λ 4 = k 4L. Proof. Since Γ k (x) f(x), then A k f(y k ) A k f(x) + 1 x x 0, x X = A k f(y k ) A k f(x ) + 1 x x 0. Strongly Convex Case Suppose we start with the following two assumptions (A1) f is differentiable on X and for L > 0 we have f(x) f( x) L x x for all x, x X (A) f is µ-strongly convex We then have that (A1), (A) imply that for x, x X, l f ( x, x) + µ x x f( x) l f ( x, x) + L x x Algorithm 3. The Nesterov Algorithm for µ-strongly convex functions under (A1), (A) is (0) Let x 0 R n be given and set y 0 = x 0, k = 0, A 0 = 0, 1 L λ 1 L µ. (1) Compute () Set k k + 1 and go to (1). Note that a k = (A k + a k )λ k = A k+1 λ k. λ k = (1 + µa k )λ a k = 1 + λ k + 4λ ka k A k+1 = A k + a k x k = A k y k + a k x k A k+1 A k+1 ˆx k = P X (ˆx k ) y k+1 = argmin x X x k+1 = x k l f (x; ˆx k ) + 1 a k 1 + A k µ λ x ˆx k + µ } x ˆx k [ ] yk+1 x k + µ(y k+1 x k ) λ Proposition.10. Let q(y) be a µ-strongly convex function such that q f on X. For λ > 0 and ˆx R n, define ŷ = argmin q(y) + 1 } y x. y X λ

28 Winter 017 ALGORITHMS Then the function satisfies (a) γ(ŷ) = q(ŷ) (b) ŷ = argmin y Y q(y) + 1 λ y x }. (c) γ is µ-strongly convex on R n (d) γ q on X which implies γ f on X ˆx ŷ γ(y) = q(ŷ) + λ, y ŷ + µ y ŷ Aside (for the exam). If φ minφ(x)} and φ is β-strongly convex, with x = argmin x φ(x) then φ + β x x φ(x). Aside (for the exam). If f is µ-strongly convex, then λf + 1 x x 0 is (λµ + 1) strongly convex. Proposition.11. For every k 0 define Γ k (y) = k 1 i=0 a iγ i (y) A k, y R n (1) = A k Γ k = A k 1 Γ k 1 + a k 1 γ k 1 where ˆxk y k+1 γ k (y) = q k (y k+1 ) +, y y k+1 + µ λ y y k+1 q k (y) = l f (y; ˆx k ) + µ y ˆx. Then we have (a) Γ k is µ-strongly convex (b) γ k q k f on X (c) Γ k f on X (d) x k = argmin x R n Ak Γ k (x) + 1 x x 0 } (e) A k f(y k ) mina k Γ k (x) + 1 x x 0 } Proof. (a) Obvious. (b) Use the fact that q k f on X and γ k q k on X follows from the previous proposition. (c) Γ k f on X follows from (1) and the fact that γ i f on X (d) and (e) By induction on k. For k = 0, it is obvious since A 0 = 0. First, assume that (d) k and (e k ) holds. Then for all x R n we have A k Γ k (x) + 1 x x 0 A k f(y k ) + A kµ + 1 x x k.

29 Winter 017 ALGORITHMS So, min A k+1 Γ k+1 (x) + 1 } x x 0 x R n = min A k Γ k (x) + a k γ k (x) + 1 } x x 0 x R n min A k f k (x) + A } kµ + 1 x x k + a k γ k (x) x R n min A k γ k (x) + A } kµ + 1 x x k + a k γ k (x) x R n min (A k + a k )γ k A k y k + a k x x R n A k + a k + A kµ + 1 } } x = min A k+1 γ k ( x) + A kµ + 1 x R n =A k+1 min x R n =A k+1 min x R n γ k ( x) + λ k λ A k+1 a k γ k ( x) + 1 λ x x k } A k+1 a x x k k } x x k } x x k Now f(y k+1 ) l f (y k+1 ; ˆx k ) + L y k+1 ˆx k l f (y k+1 ; ˆx k ) + µ y k+1 ˆx k + L µ y k+1 x k q k (y k+1 ) + 1 λ y k+1 x k } = min y X = min y R n q k (y) + 1 y x λ γ k (y) + 1 } y x λ and hence min x R n Ak+1 Γ k+1 (x) + 1 x x 0 } f(y k+1 ). Let us prove that By (d) k, A k Γ k (x k ) + x k x 0 = 0 and also (d) k+1 A k+1 Γ k+1 (x k+1 ) + x k+1 x 0 = 0. Γ k (x) = Γ k ( x) + µ(x x), x, x R n, k 1. (i) So, x k+1 x 0 + A k+1 Γ k+1 (x k+1 ) =x k+1 x 0 + A k+1 [ Γ k+1 (x k ) + µ(x k+1 x k )] =x k+1 x 0 + A k+1 Γ k+1 (x k ) + A k+1 µ(x k+1 x k ) =x k+1 x 0 + A k Γ k (x k ) + a k γ k (x k ) + µa k+1 (x k+1 x k ) (i) =(1 + µa k+1 )(x k+1 x k ) + a k γ k (x k ) [ ] xk y k+1 = a k + µ(x k y k+1 ) + a k γ k (x k ) λ =0

30 Winter 017 ALGORITHMS Corollary.6. For all k 1 and x X we have Proof. (e) implies that f(y k ) f 1 A k x 0 x A k f(y k ) A k Γ k (x ) + 1 x x 0 A k f(x ) + 1 x x 0 Proposition.1. For every k 1 we have Proof. Note that we have and hence A k max k 4L, 1 L a k λ k + λ k A k A k+1 =A k + a k A k A 1 (1 + ( ) (k 1) } µ 1 +. L = λ k + λ k A k + A k ( ) Ak λk = + + λ k 4 ( ) Ak Ak µλ + + µa kλ 4 ( ) =A k µλ µλ 4 ) µλ A k (1 + ) µ A k (1 + L The first part of the maximum is from the original Nesterov method. ) (k 1) ( ) (k 1) µ µ = λ 1 +. L L.9 Quasi-Newton Methods Quasi-Newton Method s General Scheme (0) Let x 0 R n and H 0 R n n symmetric and H 0 > 0 be given. (1) For k = 0, 1,,... set d k = H k g k x k+1 = x k + α k d k.

31 Winter 017 ALGORITHMS Update H k to obtain H k+1 > 0 and symmetric. Here, we want H k [ f(x k )] 1. Motivation Let Then, and if f is quadratic then q k = f(x k )p k. Secant Equation p k = H k+1 q k which comes from our above approximation. Rank-One Updates (SR1) H k+1 = H k + a k z k z T k where a k R and z k R n. We want q k = g k+1 g k p k = x k+1 x k. q k = f(x k )p k + o( p k ) p k = H k+1 q k = H k q k + a k (z T k q k )z k and so z k is proportional to p k H k q k. If we choose z k = p k H k q k then 1 = a k (zk T q k ) = a k (p k H k q k ) T 1 q k = a k = (p k H k q k ) T q k and we are left with the update H k+1 = H k + (p k H k q k ) T (p k H k q k ) T (p k H k q k ) T q k Rank-Two Updates H k+1 = H k + +auu T + bvv T for a R and u, v R n. The secant equation implies that If we choose u = p k and v = H k q k and enforce that p k = H k+1 q k = H k q k + a(u T q k )u + b(v T q k )v. a(p T k q k ) = 1 = a = 1 p T k q k b(qk T 1 H k q k ) = 1 = b = qk T H kq k then we have the Davidon-Fletcher-Powell (DFP) method with the update Hk+1 DF P = H k + p kp T k p T k q H kq k qk T H k k qk T H. kq k Lemma.15. For c, d R n, we have c d c T d and equality holds if and only if c, d are colinear. Theorem.7. If p T k q k > 0 for all k 0 then all H k s generated in the above way is positive definite and symmetric. Proof. We proceed by induction on k. For k = 0, it is obvious since H 0 > 0 assumption. Assume that H k > 0 for some k 0. Let x 0 be given. Then, x T H k+1 x = x T H k x + (pt k x) (p T k q k) (qt k H kx) qk T H. kq k

32 Winter 017 ALGORITHMS Let c = H 1/ k x and d = H 1/ q k. Then k x T H k+1 = c (ct d) d + (pt k x) (p T k q k) = c d (c T d) d + (pt k x) (p T k q k) 0 from the previous lemma. Claim. x T H k+1 x > 0 Proof. Assume for contradiction that x T H k+1 x = 0. Then p T k 0 = p T k x = λqt k p k 0 and H k+1 > 0 as required. = 0 and c, d are colinear. That is x = λq k for λ 0. Hence Question 1. How can we guarantee the following condition for α k > 0? 0 < q T k p k = (g k+1 q k ) T (α k d k ) = α k (g T k+1d k g T k d k ) Solution. It is enough to enforce g T k+1 d k > g T k d k. An example of such an inexact line search is the Wolfe-Powell line search with 0 < σ < τ < 1. In particular, it has the conditions (1) f(x k + α k d k ) f(x k ) + α k σg T k d k () g k+1 d k τg k d k > g k d k Sherman-Morrison Formula Proposition.13. Assume that A = B + USV T where S R m m, A, B R n n non-singular and U, V R n m. If P = S 1 + V T S 1 U is non-singular then A 1 = B 1 B 1 UP 1 V T B 1 Other Rank-Two Updates We could try the following iteration scheme x k+1 = x k α k B 1 k g k, B k f(x k ) where B k+1 is obtained from B k by the following rank two formula (B k p k = q k ): BF GS Bk+1 = B k + q kqk T qk T p B kp k p T k B k k p T k B. kp k We call this the Broyden-Fletcher-Goldfarb-Shannon (BFGS) update. Using inversion, we can use the Sherman-Morrison formula to get BF GS H BF GS k+1 = (Bk+1 ) 1 = H k + ( 1 + qt k H ) kq k pk p T k qk T p k p T k q p kp T k H k + H k q k p T k k qk T p k BF GS where A = Bk+1, B = B k and U = [q k, B k p k ], V = U and S = [ 1 p T k q k p T k B kp k ]. Broyden s Family of Algorithms Let φ = φ k R. Then the method is defined as H φ k+1 P = (1 φ)hdf k+1 + φh = φh DF P k+1 + φv k v T k BF GS k+1

33 Winter 017 ALGORITHMS where ( v k = (qk T H k q k ) 1/ pk p T k q k H kq k q T k H kq k ). Theorem.8. If H k > 0, p T k q k > 0, φ 0 then H φ k+1 > 0. Theorem.9. If f(x) = 1 xt Qx b T x + c with Q > 0 then for every k 0 such that g k 0 we have: (1) H φ k+1 q j = p j for j = 0, 1,..., k () p T j Qp i = 0 for 0 i < j k (3) p 0,..., p k are nonzero Hence, the method terminates in m n iterations. If m = n then H n = Q 1. Remark.10. Since q j = Qp j then H k+1 q j = p j = (H k+1 Q)p j = q j for j = 0, 1,..., k and so H k+1 Q acts like an identity operator on a particular subspace. In particular, (H k+1 Q)x = x for all x [p 0,..., p k ]. Theorem. If H 0 = I then the iterates generated by Broyden s Quasi-Newton method, with the exact line search method, are identical to those generated by the conjugate gradient method. Convergence Result for General f Theorem.10. Let f : R n R C (R n ) and x 0 R n be such that (1) S = x R n : f(x) f(x 0 )} is bounded and convex () f(x) > 0 for all x S Let x k } be a sequence generated by the Broyden Quasi-Newton method x k = x k α k H φ k k g k where φ k [0, 1] and H 0 = I and α k is chosen by the W-P rule and α k = 1 is the first attempted step size. Then, superlinearly in the sense that where x is the unique global minimum of f over S. Limited Memory Quasi-Newton Methods lim x k = x k x k+1 x lim k x k x = 0 The general formula for a Quasi-Newton method is ( ) ( φ(h, p, q) = H qt Hq pp T pq T p T q p T q H + Hqp T ) p T q ) ) = (I pqt p T H (1 qpt q p T + ppt q p T q BF GS and in particular, Hk = φ(h k 1, p k 1, q k 1 ). The idea for the limited memory variant is that we store the latest pairs (p i, q i ) for i = k 1,..i, k m and generate H k recursively through the steps 1. H = H k 0 (simple, say H = I). For i = k m,..., k 1 set H φ(h, p i, q i ) 3. H k = H It turns out this scheme makes the calculation of H k g k very easy and the intermediate H matrices simple as well. The following is a full description of the algorithm.

34 Winter CONSTRAINED OPTIMIZATION Algorithm 4. (For computing H k g) u g k for i = k 1,...,, k m p T i u p T i qi α i u u α i q i end for r H0 k u for i = k m,..., k 1 β q T i r p T i qi r r + (α i β)p i end for H k g k r 3 Constrained Optimization The standard constrained optimization problem in this section will be denoted by (ECP ) min f(x) s.t. h i (x) = 0, i = 1,,..., m, x R n, f, h i C (R n ) Definition 3.1. We say that x R n is a regular point of (ECP) if h 1 (x),..., h m (x) are linearly independent (equivalently h(x) = [ h 1 (x)... h m (x)] is full column rank). Remark 3.1. If x is a regular point, the matrix is nonsingular. h(x) T h(x) R m m Theorem 3.1. (Lagrange Multiplier Theorem - First order necessary optimality conditions) If x is a regular local minimum of (ECP), then there exists a unique (!) λ R m such that f(x ) + m λ i h i (x ) = 0. i=1 More compactly, we have f(x ) + h(x )λ = 0. Proof. (construction) There exists ɛ > 0 such that Let α > 0 be given and, for every k N, let where existence is guaranteed by the Weierstrass theorem. Claim 3.1. lim k x k = x. f(x) f(x ), x B(x ; ɛ) = S s.t. h(x) = 0. (0) x k argmin F k (x) := f(x) + k x S h(x) + α x x

35 Winter CONSTRAINED OPTIMIZATION Proof. (of claim) For all k we have F k (x k ) F k (x ) f(x k ) + k h(x k) + α x k x f(x ). (1) Since f(x) is bounded on S, we have f(x k )} is bounded. As k, we have lim h(x k) = 0. () k Let x be an accumulation point of x k }. By (), we have h( x) = 0 and by (1) we have f( x) + α x x f(x ). Since x S and h( x) = 0, by (0), we have and by (3),(4), x x = 0. f( x) f(x ) (4) (Th. proof cont.) For all k sufficiently large, x k int(s) and hence F k (x k ) = 0 and F k (x k ) 0. Now, where λ k = kh(x k ). Claim 3.. λ k } λ for some λ R m. Proof. We have 0 = F k (x k ) = f(x k ) + k h(x k )h(x k ) + α(x k x ) = f(x k ) + h(x k )λ k + α(x k x ) h(x k ) T h(x k )λ k = h(x k ) T [ f(x k ) + α(x k x )] = λ k = [ h(x k ) T h(x k ) ] 1 h(xk ) T [ f(x k ) + α(x k x )] = lim k λ k = [ h(x ) T h(x ) ] 1 h(x ) T [ f(x )] := λ (Th. proof cont.) Taking limits with the above results gives f(x ) + h(x )λ = 0. Theorem 3.. (Second Order Necessary Conditions) If x is a regular local minimum of (ECP), then there exists a unique λ R m such that f(x ) + h(x )λ = 0 and for all d V (x ) where Proof. Define d T ( f(x ) + h(x )λ ) d 0 V (x ) = d R n : h(x ) T d = 0}. F k (x) := f(x) + k h(x) + α x x and note for all k sufficiently large, 0 F k (x k ) = f(x k ) + h i (x k )λ k + k h(x k ) h(x k ) T + αi

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file