ISyE 6663 (Winter 2017) Nonlinear Optimization

Size: px
Start display at page:

Download "ISyE 6663 (Winter 2017) Nonlinear Optimization"

Transcription

1 Winter 017 TABLE OF CONTENTS ISyE 6663 (Winter 017) Nonlinear Optimization Prof. R. D. C. Monteiro Georgia Institute of Technology L A TEXer: W. KONG Last Revision: May 3, 017 Table of Contents Index 1 1 Review of Concepts Unconstrained Optimization Convexity Projection onto Convex Sets Algorithms 8.1 Steepest Descent Projected Gradient Method Gradient-type Methods Inexact Line Search Newton s Method Conjugate Gradient Method General Conjugate Gradient Method Nesterov s Method Quasi-Newton Methods Constrained Optimization General NLPs Augmented Lagrangian Methods Global Method Barrier Methods Interior Point Methods Interior Point Algorithm Duality Dual Function Existence of G.M. s Augmented Lagrangian Method vs. Duality These notes are currently a work in progress, and as such may be incomplete or contain errors. i

2 Winter 017 ACKNOWLEDGMENTS ACKNOWLEDGMENTS: Special thanks to Michael Baker and his L A TEX formatted notes. They were the inspiration for the structure of these notes. ii

3 Winter 017 ABSTRACT Abstract The purpose of these notes is to provide the reader with a secondary reference to the material covered in ISyE iii

4 Winter REVIEW OF CONCEPTS 1 Review of Concepts 1.1 Unconstrained Optimization Definition 1.1. For a set S R n and f : S R, an optimization problem can be formulated as which we will call the standard minimization problem. Example 1.1. Here are some basic examples: (1) S = x R n : Ax b} min(max) f(x) s.t. x S () S = x R n : h(x) = 0, g(x) 0, x X} where h : R n R m, g : R n R r, and X R n is simple Remark 1.1. If S = R n then the problem is unconstrained, otherwise if S R n then it is constrained. Example 1.. (Least Squares) For error e i = y i ˆf(t i ) and trial function ˆf(t) = x 1 +x exp( x 3 t), a constrained optimization problem is min m m e i = (y i x 1 x e x3ti ) i=1 s.t. x 3 0 i=1 x 1 + x = 1 Definition 1.. x S is a [strict] global minimum (optimal solution) of the standard minimization problem if f(x) [>] f(x ) [x x ] for all x S. Similar definitions follow for maximization problems. Notation. We will denote: B(x ; ε) = x R n : x x < ε} B(x ; ε) = x R n : x x ε} Definition 1.3. x S is a [strict] local minimum of the standard minimization problem ε > 0 such that f(x) [>] f(x ) [x x ] for all x S B(x, ε). Definition 1.4. S is compact iff S is closed and bounded. Theorem 1.1. (Weierstrass) If S is compact and f is continuous on S, then the standard minimization problem has a global minimum. Corollary 1.1. If S is closed and f is continuous on S and lim x,x S f(x) = then the standard minimization problem has a global minimum. The condition lim x,x S f(x) = is instead required for maximization problems. Note that: lim f(x) = ( M 0, r 0 s.t. x > r, x S = f(x) > M) x,x S ( M 0, r 0 s.t. f(x) M = x S, x r) x S : f(x) M} B(0, r) M 0, x S : f(x) M} is bounded. Proof. (Sketch) Pick x 0 S such that M = f(x 0 ) and remark that x S : f(x) f(x 0 )} is compact. The rest follows from Weierstrass. Definition 1.5. Given S = R n, f : R n R, x R n, the gradient of f at x is f( x) = ( f ( x),..., f ) T ( x) R n x 1 x n

5 Winter REVIEW OF CONCEPTS Remark 1.. (Interpretations) (1) In the set x R n : f(x) = f( x)} the gradient lies perpendicular to this set and points in the direction of steepest ascent. () The graph of the function f is (x, f(x)) R n+1 : x R n } and the gradient defines a linear approximation at x given by t = f( x) + f( x), x x. In particular, ( ) T ( ) f( x) x x 0 = 1 t f( x) Proposition 1.1. x is a local minimum of the standard optimization problem and f is differentiable at x = f(x ) = 0. Proof. Let d R n be given. For every t > 0 sufficiently small, 0 f(x +td) f(x) t as t 0 + we get f(x ), d = f(x ) T d 0 for any d R n. This is only the case for when f(x ) = 0 as the case for d = f(x ) = f(x ) 0. Definition 1.6. H R n n is positive semi-definite if x T Hx 0 for all x R n (Notation H 0). It is positive definite if x T Hx > 0 for all x R n, x 0. Fact 1.1. If f is twice continuously differentiable at x, then is symmetric. f(x) = f (x) = [ ] f (x) x i x j ij Proposition 1.. x is a local minimum of the standard optimization problem and f is twice continuously differentiable at x (or f C (R)) = f(x ) = 0 and f(x ) 0. Proof. Note that f(x + h) = f(x) + f(x) T h + 1 ht f(x)h + r(h) h lim r(h) = 0 h 0 or equivalently f(x + h) = f(x) + f(x) T h + 1 ht f(x + th)h for some t (0, 1). The case for f(x ) = 0 has already been shown so let H = f(x ) and d R n. We want to show that d T Hd 0. We have for t > 0 sufficiently small, from the first expansion. Dividing by t gives us Taking t 0 yields 0 d T Hd. 0 f(x + td) f(x ) = t f(x) T d + 1 }} t d T Hd + t r(td) d =0 0 1 dt Hd + r(td) d. Example 1.3. The converse is generally not true. Consider the case f(x) = x 3 which satisfies the first and second order conditions at x = 0 but does not have a local minimum at that point. Theorem 1.. Assume that f C and x R n is such that f(x ) = 0, f(x ) > 0. Then x is a strict local minimizer of the standard minimization problem. Proof. Let H = f(x ). By Weierstrass Theorem, choose α > 0 such that u T Hu α for all u R n such that u 1. We have f(x + h) f(x ) = 1 ht Hh + r(h) h lim r(h) = 0 h 0

6 Winter REVIEW OF CONCEPTS which implies that δ > 0 such that h δ = r(h) α 4 f(x + h) f(x ) = h [ 1 and hence, if h δ, we have ( h T h [ α α 4 ) ( ) ] h H + r(h) h h ] = 1 4 α h Hence, if 0 < h δ then f(x + h) f(x ) > 0. So, x is a local minimum of the standard minimization problem. Example 1.4. The above condition is not necessary and the converse is not true. Consider the function f(x) = x 4 at x = Convexity Definition 1.7. C R n is a convex set if (x, y) := tx + (1 t)y : t (0, 1)} C for all x, y C. Here are some properties: 1) If C i } i I is a collection of convex sets in R n then i I C i is convex. ) If T : R n R m is affine, C R n, and D R m then T (C), T 1 (D) are convex. 3) C i R ni convex for i = 1,,..., r implies that C 1... C r is convex 4) C i R n convex for i = 1,,..., r implies that C C r (Minkowski sum) is convex 5) C R n is convex, α R implies that αc is convex 6) C convex implies that cl(c) and int(c) are convex Example 1.5. Here are some examples: 1) Hyperplane: 0 u R n, β R define H = H(u, β) := x R n : u T x = β} ) Half-spaces: H + = x R n : u T x β}, H = x R n : u T x β} 3) Polyhedra: i H i Proposition 1.3. If C is convex then n i=1 α ix i C for x i C, α i 0, i = 1,,..., n, and n i=1 α i = 1. Proof. (Can be done by induction, using convexity) Definition 1.8. Let C R n be a convex set and f(x) be a unction defined on C. A function f is convex on C if for all x, y C, t (0, 1) we have f(tx + (1 t)y) tf(x) + (1 t)f(y) The function f is strictly convex if the above inequality if the above holds strictly whenever x y. Definition 1.9. f is β-strongly convex (β > 0) on C if for all x, y C and t (0, 1). f(tx + (1 t)y) tf(x) + (1 t)f(y) β t(1 t) x y Proposition 1.4. f is β-strongly convex iff f β is strongly convex. Proof. (Left as an exercise) Proposition 1.5. If fis convex on C then for every α R, the sets are convex. x C : f(x) < α} x C : f(x) α}

7 Winter REVIEW OF CONCEPTS Proposition 1.6. The following are equivalent (1) f is convex on C () x, t C R : f(x) t} is convex () x, t C R : f(x) < t} is convex Proof. Left as an exercise to the reader. Proposition 1.7. (Jensen s inequality) Assume f is convex on C. If x 1,..., x r C with r i=1 α i = 1, α 1,..., α r 0 then f ( r i=1 α ix i) r i=1 α if(x i ). Proof. Since f is convex, the set U = (x, t) C R : f(x) t} is convex. Clearly, (x i, f(x i )) T U for all i = 1,..., r. So i α i(x i, f(x i )) T = ( i α ix i, i α if(x i )) T U and hence f ( r i=1 α ix i) r i=1 α if(x i ) from the definition of U. Notation 1. For Ω R n we write f C 1 (Ω) if f is continuously differentiable of every x Ω. Proposition 1.8. For Ω R n convex and f C 1 (Ω) the following are equivalent: (a) f is (strictly) convex on Ω (b) f(y)(>) f(x) + f(x) T (y x), x, y Ω (x y) (c) [ f(y) f(x)] T (y x)(>) 0, x, y Ω (x y) Proof. [(a) = (b)] Let x, y Ω be given. For all t (0, 1), we have f(ty + (1 t)x) tf(y) + (1 t)f(x) = f(x + t(y x)) f(x) + t[f(y) f(x)] f(x + t(y x)) f(x) = f(y) f(x) t t 0 = f(x) T (y x) f(y) f(x) [(b) = (a)] Let x, y Ω be given and z t = ty + (1 t)x Ω. Then by (b), f(y) f(z t ) + f(z t ) T (y z t ) (1) f(x) f(z t ) + f(z t ) T (x z t ) () and (t)(1) + (1 t)() yields tf(y) + (1 t)f(x) f(z t ) + f(z t ) T (ty + (1 t)x z t ) = f(z t ) = f(ty + (1 t)x) [(b) = (c)] Just add the two inequalities: f(y) f(x) + f(z t ) T (y x) f(x) f(y) + f(z t ) T (x y) [(c) = (b)] For some t (0, 1), f(y) f(x) = f(z t ) T (y x)

8 Winter REVIEW OF CONCEPTS where z t = x + t(y x). Since z t x = t(y x) we have f(y) f(x) f(x) T (y x) = [ f(z t ) f(x)] T (y x) = 1 t [ f(z t) f(x)] T (z t x) 0 Proposition 1.9. f is β-strongly convex (β > 0) iff f β is convex. Proposition For Ω R n convex, f C 1 (Ω) and β R, the following are equivalent: (a) f β is convex (b) x, y Ω, f(y) f(x) + f(x) T (y x) + β y x (c) x, y Ω, [ f(y) f(x)] T (y x) β y x Proof. Define f := f β and remark that f is convex and from a previous result, f(x) = f(x) βx f(y) f(x) + f(x) T (y x), x, y Ω (1) [ f(y) f(x)] T (y x) 0, x, y Ω () and (1) is equivalent to (b) and () is equivalent to (c). Proposition For Ω R n convex, f C 1 (Ω) and M R, the following are equivalent: (a) M f is convex (b) x, y Ω, f(y) f(x) + f(x) T (y x) + M y x (c) x, y Ω, [ f(y) f(x)] T (y x) M y x Proof. Apply the previous proposition with β = M, f = f. Proposition 1.1. Assume Ω R n is convex, f C 1 (Ω) is convex on Ω. The the following are equivalent for x R n : (a) x is a global minimum of f on Ω (b) x is a local minimum of f on Ω (c) f( x) T (x x) 0, x Ω or f ( x; x x) 0 where f ( x; x x) = f( x) T (x x) Proof. [(a) = (b)] Obvious. [(b) = (c)] Since x is a local minimum, f( x + t(x x)) f( x) 0 for t > 0 sufficiently small. If we divide by t and take t 0 then f( x) T ( x x). [(c) = (a)] Let x Ω be given. By (c), f( x) T (x x) 0. By the convexity of f, we have and so x is a global minimum. Remark 1.3. If x int(ω) then (c) f( x) = 0. f(x) f( x) + f(x) T (x x) = f(x) f( x) Proof. ( = ) Assume f( x) 0. We know ɛ > 0 such that B( x; ɛ) Ω and f( x) T (x x) 0, x Ω, x B( x; ɛ) from (c). Now x := x ɛ f( x) f( x) B( x; ɛ) and substituting this into the above equation yields 0 ɛ f( x) < 0 leading to a contradiction.

9 Winter REVIEW OF CONCEPTS Proposition If Ω R n is convex, f C 1 (Ω) is strictly convex on Ω then f has at most one global minimum. Proof. Assume x Ω is a global minimum of minf(x) : x Ω}. Let x x, x Ω be given. We have f(x) > f( x) + f( x) T (x x) and f( x) T (x x) 0 from a previous result. So f(x) > f( x) and thus x is the only global minimum. Proposition If Ω is convex, f C 1 (Ω), f( ) is L Lipschitz continuous on Ω (i.e. f(y) f(x) L x y for all x, y Ω), then The second set of inequalities is proven by Cauchy-Schwarz. L x y f(y) [f(x) + f(x) T (y x)] L x y, L x y [ f(y) f(x)] T (y x) L x y. Proposition If Ω R n is closed and convex, and f C 1 (Ω) is β strongly convex. Then, f = inff(x) : x Ω} x has a unique optimal solution x and f(x) f + β x x, x Ω Proof. Take x 0 Ω. Since f is β strongly convex, we have f(x) f(x 0 ) + f(x) T (x x 0 ) + β x x 0 for all x Ω. Hence, as x, x Ω, we will have f(x). Thus, inff(x) : x Ω} has a unique optimal solution x. Hence, f(x ) T (x x ) 0 for all x Ω and f(x) f + β x x, x Ω. 1.3 Projection onto Convex Sets Definition For Ω R n closed and convex, x R n, we define } 1 Π Ω (x) = argmin y x : y Ω} = argmin y y y x : y Ω as the projection of x onto Ω. The latter definition is useful because the 1 function is strongly convex. Corollary 1.. Using the previous definition and x, y x T y, (1) Π Ω is well-defined () x = Π Ω (x) y x, x x 0, y Ω (3) x 1 x, Π Ω (x 1 ) Π Ω (x ) Π Ω (x 1 ) Π Ω (x ) and hence x 1 x Π Ω (x 1 ) Π Ω (x ), x 1, x Ω. That is, Π Ω is non-expansive.

10 Winter REVIEW OF CONCEPTS Proof. (1) is obvious. For (), let f(y) = 1 y x. Then, For (3), define x i = Π Ω(x i ), i = 1,. We have and adding the two above inequalities yields x = Π Ω (x) x argminf(y) : y Ω} y f(x ) T (y x ) 0, y Ω (x y) T (y x ) 0, y Ω. (x 1 x 1) T (x x ) 0 (x x ) T (x 1 x 1) 0 [(x 1 x ) (x 1 x )] T (x x 1) 0 = x 1 x (x x 1 ) T (x x 1) x x 1 x x 1. Remark 1.4. If Ω is closed convex, x Ω, and we define the normal cone of x as N Ω ( x) = n R n : n T (y x) 0, y Ω}, then the second condition of the previous propositions says 0 x + N Ω (x ) x. Remark 1.5. If f is convex [I m assuming we need this], then the problem min y f(y) : y Ω} is equivalent to 0 f(x ) + N Ω (x ). This follows from the fact that the optimality condition for the problem is f(x ) T (y x ) 0, y Ω f(x ) N Ω (x ). Proposition Assume Ω R n convex and f C 1 (Ω). Then, (a) f(x) 0, x Ω = f is convex on Ω. (b) f is convex on Ω and int Ω = f(x) 0, x Ω. (c) f(x) > 0, x Ω = f is strictly convex on Ω. Proof. (a) Let x, y Ω. We will show f(y) f(x) + f(x) T (y x). We have f(y) = f(x) + f(x) T (y x) + 1 (y x)t f(ξ)(y x) for some ξ = x + t(x y) and t (0, 1). Clearly ξ Ω and hence f(ξ) 0. So d T f(ξ)d 0, d R n and the result follows. (b) By contradiction, assume x Ω such that f(x) 0. Without loss of generality, we may assume that x int Ω from the fact that Ω cl(int (Ω)). From our assumption, we know λ min [ f(x)] < 0 and d R n, d T f(x)d < 0. By continuity, ɛ > 0 such that d T f(y)d < 0, y B(x, ɛ). Take x = x + ɛd. Then, f( x) = f(x) + f(x) T ( x x) + 1 ( x x)t f(y)( x x) for y = x + t( x x) B(x, ɛ) and t (0, 1) and hence (c) Same as (a) except we use strictness. f( x) < f(x) + f(x) T ( x x)

11 Winter 017 ALGORITHMS Corollary 1.3. Assume Ω R n is convex, f C (Ω). For m, M R, we have mi f(x) MI f( ) m and M f( ) are convex m y x f(y) [f(x) + f(x) T (y x)] M y x m y x [ f(y) f(x)] T (y x) M y x Algorithms Definition.1. d R n is a descent direction at x if δ > 0 such that t (0, δ) we have f(x + td) < f(x). Lemma.1. If f(x) T d < 0 then d is a descent direction at x. Example.1. We may select d = f(x) or d = D f(x) where D 0 as long as f(x) 0. Definition.. A line search method is an algorithm with an update of the form x k+1 = x k + α k d k where d k is a descent direction at x k and α k is a positive step size. Definition.3. The trust region method has the following principle: α k? = argmin f(x k + d k ) : t > 0}. t [0,ᾱ] That is, given x k R n we approximate f(x k + p) m k (p) where m k (p) is a simple function (e.g. f(x h ) + f(x h ) T p) and solve p k = argmin p Tk R nm k(p)} (e.g. T k = B(0, δ k )). If f(x k + p k ) is close to m k (p k ) then we iterate x k+1 = x k + p k. Otherwise, we reject x k + p k with x k+1 = x k and shrink T k. Closeness can be defined with where ρ k 1 implies that our estimate is close. ρ k = m k(0) f(x k + p k ) m k (0) m k (p k ) = f(x k) f(x k + p k ) f(x k ) m k (p k ).1 Steepest Descent Definition.4. For a function f C 1 (R n ) which has L Lipschitz continuous gradient, the steepest descent with fixed step size method is that for given x 0 R n and θ (0, ), we update with x k = x k 1 θ L f(x k 1) k k + 1 Proposition.1. Assume that f(x k ) f in a steepest descent method. Then for all k > 1 we have min f(x i 1) f(x ( ) 0) f L 1 i k k θ( θ)

12 Winter 017 ALGORITHMS Proof. For all i 1, using our update step, we have f(x i ) f(x i 1 ) f(x i 1 ) T (x i x i 1 ) + L x i x i 1 θ L f(x i 1) + θ L f(x i 1) = θ ( L f(x i 1) 1 θ ) Sof(x i 1 ) f(x i ) θ( θ) L f(x i 1) and summing for i = 1,,..., k we get The result follows after a simple re-arrangement. θ( θ) f(x 0 ) f f(x 0 ) f(x k ) L kθ( θ) L k f(x i 1 ) i=1 min f(x i 1). i=1,,...,k Definition.5. For Ω R n convex, f C 1 (Ω) which has L Lipschitz continuous gradient on Ω, the projected gradient method is that for given x 0 R n and θ (0, ), we update with x k = argmin l f (x; x k 1 ) + L } x θ x x k 1, x Ω k k + 1 where l f (x; x k 1 ) = f(x k 1 ) + f(x k 1 ) T (x x k 1 ). Lemma.. For all k 1, under the projected gradient scheme, we have 0 f(x k 1 ) + N Ω (x k ) + L θ (x k x k 1 ) Proof. Define ϕ k (x) = l f (x; x k 1 ) + L θ x x k 1. We first know that x l f (x; x k 1 ) = f(x k 1 ) and x [ L θ x x k 1 ] = L θ (x x k 1) and x k is optimal if 0 x ϕ k (x k ) + N Ω (x k ). Hence, we must have 0 f(x k 1 ) + N Ω (x k ) + L θ (x k x k 1 ). Lemma.3. Let r k = L θ (x k 1 x k ) and r k = r k + f(x k ) f(x k 1 ). Then r k f(x k ) + N Ω (x k ) and ( ) 1 r k L θ + 1 x k x k 1 Proof. (Simple algebraic manipulation using Lipschitz property.) Remark.1. If we can show that lim inf k r k = 0 then the optimality condition 0 f( x) + N Ω ( x) is approached via x k }. Lemma.4. We have f(x k 1 ) f(x k ) L ( θ θ ) x k x k 1 Proof. (will be shown next class)

13 Winter 017 ALGORITHMS Proposition.. Assume that f(x k ) f for all k 0. Then, for all k 1 we have Proof. We have min r i f(x 0) f 1 i k k f(x 0 ) f f(x 0 ) f(x k ) k = (f(x i 1 ) f(x i )) i=1 ( ) L(θ + 1) θ( θ) L ( ) k θ x i x i 1 θ i=1 L ( ) θ k min θ x i x i 1 1 i k L ( ) ( ) θ 1 kl θ θ + 1 min r i 1 i k. Projected Gradient Method Definition.6. For a space Ω R n which is closed and convex, a function f C 1 (Ω), which has L Lipschitz continuous gradient, the linear approximation of f is defined as l f ( x; x) := f(x) + f(x) T ( x x) where l f ( x; x) = f(x), l f (x; x) = f(x). We have previously seen f( x) l f ( x; x) L x x, x, x Ω. Definition.7. Given a space Ω R n which is closed and convex, a function f C 1 (Ω), which has L Lipschitz continuous gradient, a point x 0 Ω, and θ (0, ), the projected gradient method is x k = argmin l f (x; x k 1 ) + Lθ } x x k 1 (1) x k k + 1 Lemma.5. For all k 1, we have f(x k ) f(x k 1 ) L ( θ θ ) x k x k 1 Proof. By (1), Taking x = x k 1, l f (x; x k 1 ) + L θ x x k 1 l f (x k ; x k 1 ) + L θ x k x k 1 + L θ x x k. () f(x k 1 ) l f (x k ; x k 1 ) + L θ x k x k 1 = l f (x k ; x k 1 ) + L ( L x k x k 1 + θ L ) x k x k 1 f(x k ) + L ( ) θ x k x k 1 θ

14 Winter 017 ALGORITHMS Lemma.6. Given a space Ω R n which is closed and convex, a convex function f C 1 (Ω), which has L Lipschitz continuous gradient, and the set of optimal solutions Ω for the optimization problem for every k 1 and x Ω we have Proof. By (), with θ = 1 and x = x, we have min f(x) s.t. x Ω, L ( x x k 1 x x k ) f(x k ) f. l f (x ; x k 1 ) + L x x k 1 l f (x k ; x k 1 ) + L }} x k x k 1 + L }} x x k () f(x )+ L x x k 1 f(x k )+ L x x k and the result follows after an algebraic re-arrangement. Lemma.7. Under the previous lemma s assumptions, for all k 1 and x Ω, we have Proof. (easy exercise) L ( x x 0 x x k ) k [f(x i ) f ] k [f(x k ) f ] Lemma.8. Under the previous lemma s assumptions, for all k 1 and x Ω, we have i=1 x k x x 0 x f(x k ) f L k x 0 x. Note that if x = P Ω (x 0 ) then d 0 := x 0 P Ω (x ) can be thought of a distance of x 0 to Ω and Proof. (follows from the previous lemma) f(x k ) f Ld 0 k. Lemma.9. Define r k = L θ (x k 1 x k ) + f(x k ) f(x k 1 ). Then r k f(x k ) + N Ω (X k ) where if r k = 0 then x k satisfies the optimality condition of min f(x) s.t. x Ω. Proof. Left as an exercise (?) Definition.8. a k } k=1 R converges geometrically if there exists γ 0 and τ (0, 1) such that a k γτ k, k 1. Note 1. lim k (a k /[1/k p ]) = 0 for p > 0, but the rate at which a k diminishes may be REALLY slow relative to 1/k p.

15 Winter 017 ALGORITHMS Lemma.10. Given a space Ω R n which is closed and convex, a β strongly convex function f C 1 (Ω), which has L Lipschitz continuous gradient, and the set of optimal solutions Ω for the optimization problem min f(x) s.t. x Ω, for every k 1 and x Ω we have ( L 1 β ) k d 0 f(x k ) f. Proof. By (), with θ = 1 and x = x, we have l f (x ; x k 1 ) + L x x k 1 l f (x k ; x k 1 ) + L }} x k x k 1 + L }} x x k () f(x )+ (1 β) x x k 1 f(x k )+ L x x k and the result follows after an algebraic re-arrangement and iterating over k. Exercise.1. Recall r k f(x k ) + N Ω (X k ) and r k L ( ) x k x k 1. θ For θ = 1, we have Show that r k L x k x k 1. ( ) 1 min r i = O i=1,...,k k if f is convex ( ( r k = O 1 β ) ) k if f is β strongly convex L Remark.. For a function f(x) = 1 (x x ) T Q(x x ) + γ, we have L = λ max (Q), β = λ min (Q) and cond(q) = λ max (Q)/λ min (Q) so r k is related to the inverse condition number of Q..3 Gradient-type Methods Problem.1. For standard minimization algorithms of the form x k+1 = x k + α k d k where α k, d k are respective step sizes and descent directions, what conditions on α k }, d k } should we set to ensure convergence? Remark.3. Assuming the function is still L Lipschitz, we know: f(x ) f(x) f(x) T (x x) L x x = f(x k + αd k ) f(x k ) α f(x k ) T d k + L d k. Take where at optimality, we need } α k = argmin α f(x k ) T d k + Lα α R d k f(x k ) T d k + α k d k = 0 = α k = f(x k) T d k L d k > 0.

16 Winter 017 ALGORITHMS Substituting this into the Lipschitz condition yields f(x k ) f(x k+1 ) ( f(x k) T d k ) L d k > 0. Remark.4. Let ɛ k = f(x k) T d k f(x k ) d k where ɛ k = cos θ k and θ k is the angle between d k and f(x k ). Then, f(x k ) f(x k+1 ) ɛ k f(x k) L k 1 = f(x 0 ) f f(x 0 ) f(x k ) f(x i ) f(x i+1 ) i=0 = f(x 0 ) f 1 ( ) ( k 1 min L f(x i) i k 1 i=0 ( f(x0 ) f ) = min i k 1 f(x i) L k 1 i=0 ɛ i. ɛ i ) k 1 i=0 ɛ i f(x i) So if i=0 ɛ i = (e.g. ɛ i ɛ for all i), then lim k min i k f(x i ) = 0 or lim inf h f(x k ) = 0. If ɛ i ɛ for all i, then min f(x i) L ( f(x0 ) f ) i k 1 ɛ. k L Exercise.. If α k = θ f(x k) T d k L d k and θ (0, ), show that f(x k ) f(x k+1 ) (θ θ ) ( ( f(xk ) T d k ) L d k ). Remark.5. If d k = D k f(x k ) and D k is symmetric positive definite, then cond(d n ) 1 ɛ = ɛ k ɛ > 0. The proof makes use of the fact that and with g = f(x), we have λ min (D) u u T Du λ max (D) u Du λ max (D) u ɛ k = gt k d k g k d k = gt k D kg k g k d k λ min(d k ) g k g k λ max (D k ) g k = 1 cond(d k ) ɛ..4 Inexact Line Search Remark.6. Assume now that L is not known or does not exist and define φ k (α) = f(x k + αd k ) f(x k ). We wish to choose α such that φ k (α) σφ k(0) α where σ (0, 1) is a fixed constant, where we wish to not be close to ᾱ, a root of φ. To not be close to 0, there are many strategies: (a) Goldstein rule: For some constant τ (σ, 1), we require α k to satisfy φ k (α) τφ k(0)α ( ) (b) Wolfe-Powell (W-P) rule: For some constant τ (σ, 1), we require α k to satisfy φ k(α) τφ k(0)

17 Winter 017 ALGORITHMS (c) Strong Wolfe-Powell rule: For some constant τ (σ, 1), we require α k to satisfy φ k(α) τφ k(0) with σ < 1. (d) Armijo s rule: Let s > 0 and β (0, 1) be fixed constants. Choose α k as the largest scalar from α s, sβ, sβ,...} such that (*) is satisfied. Proposition.3. With respect to Armijo s rule, 1) δ > 0 such that (*) is satisfied strictly for any α (0, δ). ) If φ k (α) : α > 0} is bounded below, there exists an open interval of α s that satisfy rules (a) to (c). Proof. Left as an exercise. Theorem.1. Suppose that 1) f C 1 (R n ) and there exists L > 0 such that for all y, z x : f(x) f(x 0 )} we have f(y) f(x) L y x. ) f(x k )} is bounded from below. 3) d k } is gradient-related if α k is chosen by Armijo s rule, i.e., there exists δ > 0 such that Then, d k δ f(x k ), k 0 ɛ k f(x k ) < and hence if i=0 ɛ i = then lim inf k f(x k) = 0. k=0 Thus, every accumulation point of x k } is a stationary point. Rates of Convergence Consider the problem min x R n f(x) = 1 xt Qx + c T x + γ } where Q > 0 is symmetric. Steepest Descent The algorithm for our problem is Proposition.4. For every k 0, we have x k+1 = x k α k g k g k = f(x k ) α k = argmin f(x k αg k ) = g k α R gk T Qg k f(x k+1 ) f f(x k ) f ( ) M m = M + m ( ) r 1 r + 1 where m = λ min (Q), M = λ max (Q) and r = M/m = cond(q) 1.

18 Winter 017 ALGORITHMS Proof. (see related proof for the projected gradient) Gradient-type Methods The algorithm for our problem is x k+1 = x k α k D k g k where D k > 0 α k = argmin f(x k αd k g k ) α R Proposition.5. For every k 0, we have f(x k+1 ) f f(x k ) f ( ) Mk m k = M k m k ( ) rk 1 r k + 1 where M k = λ max (D 1/ k QD 1/ k ), m k = λ max (D 1/ k QD 1/ k ), and r k = cond(d 1/ k QD 1/ k ). Proof. We first note that 0 = d dα f(x k + αd k ) = f(x k + αd k ) T d k = [ f(x k ) + α k Qd k ] T d k = f(x T k )d k + α k d T k Qd k implies that α k = f(x k) T d k d T k Qd. Next, if we define f(y) = f(sy) where s = D 1/ k k then f(y) = S f(sy) f(y) = S f(sy)s = SQS. For every k let y = S 1 x k and note by our iteration scheme we have f(y k ) = S f(x k ) = Sg k as well as and Sy k+1 = Sy k α k S f(sy k ) = y k+1 = y k α k f(y k ) α k = argmin f(x k αd k g k ) α R = argmin f(y k αsg k ) α R = argmin α R f(y k α f(y k )). From the previous proposition, f(x k+1 ) f f(x k ) f ( ) Mk m k = M k m k ( ) rk 1 r k + 1 where M k = λ max (D 1/ k QD 1/ k ), m k = λ max (D 1/ k QD 1/ k ), and r k = cond(d 1/ k QD 1/ k ). Remark.7. If r k 1 then For example, if D k Q 1, then the above holds. f(x k+1 ) f lim = 0. k f(x k ) f.5 Newton s Method Newton s Method

19 Winter 017 ALGORITHMS Consider a function h : R n R n where h C 1 (R n ). Newton s method finds a point x R n where h(x) = 0. The idea for a given x k, uses h(x) h(x k ) + h (x k )(x x k ) = 0 = x k+1 = x k h (x k ) 1 h(x k ). In the case of h(x) = f(x) = 0 where h (x) = f(x), we have the iteration scheme x k+1 = x k f(x k ) 1 f(x k ). In general optimization, we may use a second order approximation to f(x) and apply Newton s method to find where f(x) = 0. Local Convergence of Newton s Method Theorem.. Assume h C (R n ) and let x R n be such that h(x ) = 0, h (x ) is non-singular. Then there exists y > 0 such that if x 0 B(x ; y) then x k } obtained as x k+1 = x k [h (x k )] 1 h(x k ) is well-defined and lim x k = x x k+1 x and lim sup k k x k x <. Proof. Let L := h (x ) 1 and choose y > 0 such that for all x B(x ; y) we have - h (x) exists, h (x) 1 L - ηlm < 1 where M = sup x B(x ;y) h (x) It can be shown that Then if x k B(x ; y) we have h (x) h (y) M x y, x, y B(x ; y). x k+1 x = x k x h (x k ) 1 h(x k ) = h (x k ) 1 h(x ) h(x k ) h (x k ) 1 h(x k ). }} =0 So x k+1 x h (x k ) 1 h(x ) h(x k ) h (x k )(x x k ) 1 L [h (x k + t(x x k )) h (x k )] (x x k ) dt L L x x k h (x k + t(x x k )) h (x k ) x x k dt 1 0 = ML x x k MLµ x x k < x k x Mt x x k dt and hence lim x k x = 0. k

20 Winter 017 ALGORITHMS.6 Conjugate Gradient Method Suppose we are dealing with the problem min x R n 1 xt Qx b T x } where Q > 0 is symmetric. Let d 0,..., d n } be a basis for R n, x 0 R n, and denote [d 0,..., d k ] as the subspace spanned by d 0,..., d k. We use the notation g k = f(x k ). Lemma.11. We have x k+1 = argmin f(x) = 1 } xt Qx b T x : x x 0 + [d 1,..., d k ] if and only if x k+1 = x 0 D k ( D T k QD k ) 1 D T k g 0 where D k = [d 0...d k ] R n (k+1) g 0 = f(x 0 ) = Qx 0 b. Also, g k+1 d i for i = 0,..., k. Proof. We know that x x 0 + [d 0,..., d k ] x = x 0 + D k z, for some z R k+1 So x k+1 = x 0 + D k z k+1 where z k+1 = argmin z f(x 0 + D k z) = h(z). In particular, z k+1 solves 0 = h(z) = Dk T f(x 0 + D k jz) = Dk T [Q (x 0 + D k z) b] = (Dk T QD k )z + Dk T g 0. So, z k+1 = (Dk T QD k) 1 Dk T g 0 and the result follows after re-arranging terms and remarking that 0 = Dk T f(x 0 + D k z k+1 ) = Dk T g k+1. Definition.9. A set of directions d 0,..., d k } R n are Q-conjugate if d T i Qd j = 0 for every 0 i < j k. Equivalently, D T k QD k is diagonal. Proposition.6. Suppose that Q > 0 and d 0,..., d k are Q-conjugate vectors. Then d 0,..., d k are linearly independent. Proof. Exercise. Theorem.3. (Expanding Subspace Minimization) Assume that x k+1 = argminf(x) : x x 0 +[d 0,..., d k ]} and that d 0,..., d k 1 are Q-conjugate. Then, (a) x n = x (b) g T k+1 d i = 0 for i = 0,..., k, k 1 (c) x k+1 = x k + α k d k where α k = gt k d k d T k Qd k or equivalently α k = argmin f(x k + αd k ) or equivalently x k+1 = argminf(x) : x x k + [d k ]}. Proof. (a) and (b) are obvious. For (c), note that x k x 0 + [d 0,..., d k 1 ] x 0 + [d 0,..., d k ] and so x k + [d 0,..., d k ] = x 0 + [d 0,..., d k ].

21 Winter 017 ALGORITHMS In the previous algorithms, we can hence replace x 0 with x k. In particular, the first lemma can be replaced with the iteration scheme x k+1 = x k D k ( D T k QD k ) 1 D T k g k. Simplifying with the fact that D T k g k = (g T k d k )e k+1 D T k QD k = diag(d T 1 Qd 1,..., d T k Qd k ) (D T k QD k ) 1 D T k g k = gt k d k d T k QdT k e k+1 where e k+1 is the (k + 1) th basis vector in R n, this then reduces the iteration scheme further to ( ) ( ) g T x k+1 = x k k d k g T d T D k e k+1 = x k k d k k QdT k d T d k = x k α k d k. k QdT k Algorithm 1. (Conjugate Gradient Method [sketch]) Given x 0 R n, let d 0 = g 0 = b Qx 0. For k = 0, 1,,... do x k+1 = x k + α k d k where α k = gt k d k d T k Qd. k If g k+1 = 0, stop; else d k+1 = g k+1 + β k d k where β k = gt k+1 Qd k d T k Qd k. Remark.8. Observe that 0 = d T k+1qd k = ( d k+1 + β k d k ) T Qd k = g T k+1qd k + β k d T k Qd k = β k = gt k+1 Qd k d T k Qd k Lemma.1. (Gram-Schmidt) Assume that d 0,..., d i 1 are Q-conjugate nonzero vectors and p i / [d 0,..., d k 1 ]. Define k 1 d k = p k i=0 p T k Qd i d T i Qd i Then d 0,..., d k are Q-conjugate nonzero vectors and Proof. Exercise. k 1 d i = p k + i=0 [d 0,..., d k ] = [d 0,..., d k 1 p k ]. β k 1,i d i where β k 1 = pt k Qd i d i Qd i. Algorithm. (Alternate Conjugate Gradient) For x 0 R n, f(x) = 1 xt Qx b T x, Q > 0 symmetric, let d 0 = g 0 = b Qx 0. For k = 0, 1,,... do x k+1 = x k + α k d k where α k = gt k d k d T k Qd. k If g k+1 = 0, stop; else d k+1 = g k+1 + k i=1 β kid i where β ki = gt k+1 Qdi d. Here, we are generating the g T i Qdi k [d 0,..., d k 1 ] vectors on the fly and by adapting Gram-Schmidt we have the added bonus that we are preserving Q-conjugacy. Lemma.13. If d 0,..., d k are Q-conjugate and g k+1 / [d 0,..., d k ] then d k+1 as above satisfies (1) d k+1 is Q-conjugate w.r.t. d 0,..., d k } () [d 0,..., d k+1 ] = [d 0,..., d k, g k+1 ] Theorem.4. Assume that g i 0, i 0,..., h}. Then for all i 0, 1,..., k} we have (i) d 0,..., d i are Q-conjugate (ii) g 0,..., g i are orthogonal

22 Winter 017 ALGORITHMS (iii) [d 0,..., d i ] = [g 0,..., g i ] (iv) [d 0,..., d i ] = [g 0, Qg 0,..., Q i g 0 ] (v) α i = g i /(d T i Qd i) and g T i d i = g i Proof. By induction on i. For i = 0, it is obvious. Assume it is true for i 1. Hence, (a) [d 0,..., d i 1 ] = [g 0,..., g i 1 ] = ([g 0, Qg 0,..., Q i g 0 ] = L i 1 ) (b) g 0,..., g i 1 are orthogonal (c) d 0,..., d i 1 are Q-conjugate By our previous lemma, d i is Q-conjugate w.r.t. d 0,..., d i 1 } and so (i) follows. Also by the lemma, we know from (a) which shows (iii). Next, we have [d 0,..., d i ] = [d 0,..., d i 1, g i ] = [g 0,..., g i 1 = g i ] d i [d 0,..., d i 1, g i ] = [g 0,..., Q i 1 g 0, g i ] from (a). Also g i = g i 1 + α i 1 Qd i 1 with g i 1 L i 1 and Qd i 1 QL i 1 = L i so g i L i. This tells us then that d i L i = [d 0,..., d i ] L i. Since d 0,..., d i are linearly independent then [d 0,..., d i ] = L i and (iv) follows. Now we have g i [d 0,..., d i 1 ] since the method is a Q-conjugate direction method. Since [d 0,..., d i 1 ] = [g 0,..., g i 1 ] then g i [g 0,..., g i 1 ] and (ii) follows. For (v) note that d i = g i + u with u L i 1 and hence gi T d i = g i + u T g }} i and the definition of α i follows. =0 Proposition.7. Assume that g k+1 0. Then β ki = gk+1 g k i = k 0 i < k. Proof. By definition β ki = gt k+1 Qdi d T i Qdi Next, and the result follows. and ( ) xi+1 x i Qd i = Q α i = g i+1 g i α i = g T k+1qd i = g k+1 d T i Qd i = d T i Convergence Rate of the Conjugate Gradient Method Note that Now, we have g 0 = Q(x 0 x ) and hence ( ) gi+1 g i α i ( ) gi+1 g i = α i = dt i g i α i = g i α i x x 0 + [d 0,..., d k 1 ] x x 0 + [g 0,..., Q k 1 g 0 ] gk+1 α k i = k 0 i < k. x = x 0 + γ 1 g γ k Q k 1 g 0 for some γ R k x x = x 0 x + γ 1 Q(x 0 x ) γ k Q k (x 0 x ) = (I + γ 1 Q γ k Q k )(x 0 x ) = P k (Q)(x 0 x ) where P k P k and P k is the set of polynomials of degree at most k such that P k (0) = 1. Now we have f(x) = f(x ) + 1 (x x )Q(x x ) so f(x) f(x ) = 1 Q1/ (x x )

23 Winter 017 ALGORITHMS and the original QP is equivalent to (f(x k ) f(x )) = min Q1/ (x x ) min Q 1/ (x x ) s.t. x x 0 + [d 0,..., d k 1 ] = s.t. x x = P k (Q)(x 0 x ) P k P k = min Q1/ P k (Q)(x 0 x ) s.t. P k P k = min P k(q)q 1/ (x 0 x ) s.t. P k P k ( ) min Pk (Q) Q 1/ (x 0 x ) s.t. P k P k Proposition.8. For every k 0, we have and since f(x k ) f f(x 0 ) f where σ(q) is the spectrum of Q or the set of eigenvalues of Q. Corollary.1. For every k 0 and P k P k, we have ( min Pk (Q) s.t. P k P k P k (Q) = max λ σ(q) P k(λ) ) ( ) f(x k ) f max f(x 0 ) f P k(λ). λ σ(q) Corollary.. Assume that Q has m < n distinct eigenvalues. Then x m = x. Proof. Let λ 1,..., λ m be the distinct eigenvalues of Q. Let P m P m be defined as m i=1 P m (λ) = (λ i λ) m i=1 λ i and since P m (λ) = 0 for every λ σ(q) then from the previous proposition, the result follows. Rate of Convergence of CG Method Corollary.3. Assume that Q has (1) (n m) eigenvalues in [a, b], m > 0 () m eigenvalues which are greater than b. Then, f(x m+1 ) f f(x 0 ) f In particular, for m = 0 and a = λ min, b = λ max, we have f(x 1 ) f f(x 0 ) f ( ) b a. a + b ( λmax λ min λ max + λ min Proof. Let λ 1,..., λ m denote the eigenvalues greater than b and define λ m+1 = b+a. Next, define m+1 i=1 P m+1 (λ) = (λ i λ) m+1 i=1 λ i ).

24 Winter 017 ALGORITHMS where clearly P m+1 P m+1. By a previous proposition, f(x m+1 ) f f(x 0 ) f max λ [a,b] P m+1(λ) max λ [a,b] 1 λ a + b = ( ) b a. a + b Corollary.4. For all k 0, we have where r = M/m is the condition number of Q. ( ) f(x k ) f r 1 f(x 0 ) f r + 1 Proof. (sketch) Use the polynomials and define T k (x) = 1 Use a similar procedure as before to obtain the result. ( x + ) k x 1 ( 1 + x k x 1) = cos(k arccos x), Tk (x) 1, x [ 1, 1] ( ) T λ (m+m) k M m P k (λ) = ) P k. T k ( M+m M m.7 General Conjugate Gradient Method Definition.10. Consider the problem ( ) minf(x) : x R n } where f C 1 (R n ). The CG framework, given x 0 R n, is: For k = 0, 1,... do x k+1 = x k + α k d k d k+1 = f(x k+1 ) + β k d k where α k > 0 is the step size (e.g. use an exact or inexact line search method). Recall for convex quadratic, β k = gt k+1 Qd k d T k Qd k = g k+1 g k }} (1) = gt k+1 (g k+1 g k ) g k } } (). Using (1) in the general case leads to the Fletcher-Reeves (FR) method while () leads to the Polak-Ribière (PR) method. Note that using () implies that g T k+1d k+1 = g k+1 < 0. Theorem.5. (PR) Assume that f is such that for 0 < m M, m u u T f(x)u M u for all x, u R n. Then the PR-CG method with exact line search method converges to the unique global minimum of ( ). Theorem.6. Assume that f C (R n ) and x : f(x) f(x 0 )} is bounded. Then there exists an accumulation point x of x k } such that f( x) = 0. If f is convex then x k } x. The Strong Wolfe-Powell inexact line search is used in this scheme where 0 < σ < 1, σ < τ < 1 and f(x k + α k d k ) f(x k ) σα k f(x k ) T d k f(x k + α k d k ) T d k τ f(x k ) T d k

25 Winter 017 ALGORITHMS.8 Nesterov s Method Definition.11. Suppose that f C 1 (R n ) is convex and f(x) is L-Lipschitz where l f ( x; x) f( x) l f ( x; x) + L x x, l f ( x; x) = f(x) + f(x) T ( x x). For the problem minf(x) : x X}, let X be a closed and convex set of optimal set of solutions. The Nesterov Method is as follows: (0) Let x 0 R n be given and set y 0 = x 0, k = 0, A 0 = 0. (1) Compute () Set k k + 1 and go to (1). a k = LA k L A k+1 = A k + a k A k a k x k = y k + x k A k+1 A k+1 y k+1 = argmin l f (x; x k ) + L } x x k x X x k+1 = x k + a k L(y k+1 x k ) Proposition.9. There exists a sequence of affine functions γ k } k 0 such that γ k f and A k f(y k ) min A k Γ k (x) + 1 } x x 0 x k = argmin A k Γ k (x) + 1 } x x 0 x R n ( k 1 ) where Γ k (x) = i=0 a iγ i (x) /A k and γ i = l f (x; x i ). (1) k () k [***Aside: It is important to know that if f is µ-strongly convex, then min f(x) f + µ x x. This will show up on the exam!] Lemma.14. For every k 0 we have Proof. Trivial. k 1 A k = i=0 a i A k+1 Γ k+1 = A k Γ k + α k γ k γ k f A k Γ k A k f Proof. [of previous proposition] We proceed by induction on k. The case for k = 0 is obvious, so assume that it is true for k where (1) k and () k hold. In particular, using the previous lemma, A k Γ k (x) + 1 x x 0 A k f(y k ) + 1 x x k (3)

26 Winter 017 ALGORITHMS and so for all x X (*) we have, using the lemma again, and letting x = x(x) = A ky k +a k x A k+1, since x(x) x k = Directly, we have A k+1 Γ k+1 (x) + 1 x x 0 = A k Γ k (x) + a k γ k (x) + 1 x x 0 (3) A k f(y k ) + 1 x x k + a k γ k (x) A k γ k (x k ) + a k γ k (x) + 1 x x k ( ) Ak y k + a k x = A k+1 γ k + 1 x x k A k+1 = A k+1 γ k ( x) + 1 A k+1 ( x x k ) a k ( = A k+1 γ k ( x) + A ) k+1 x x k a k = A k+1 ( γ k ( x) + L x x k ) = A k+1 [ l f ( x; x k ) + L x x k ] A k+1 [ l f (y k+1 ; x k ) + L y k+1 x k ] a k A k+1 (x x k ). Hence (1) k+1 follows. Next, for () k+1, it is sufficient to show that This is due to the construction of the algorithm: A k+1 Γ k+1 + x k+1 x 0 = 0. A k+1 Γ k+1 = A k Γ k + a k γ k y k+1 = argmin x X () k = x0 x k + a k γ k = x 0 x k+1. = γ k + L(y k+1 x k ) = 0 = γ k = L( x k y k+1 ). γ k (x) + L x x k } Remark.9. For the constrained case where we want (*) to become x R n, take which has the property that min x R n γ k (x) = L( x k y k+1 ), x y k+1 + l f (y k+1 ; x k ) γ k (y k+1 ) = l f (y k+1 ; x k ) γ k = L( x k y k+1 ) γ k (x) + L } x x k The proof can be constructed in the same manner as before. = min x X l f (x; x k ) + L x x k }.

27 Winter 017 ALGORITHMS Corollary.5. For every k 0 and x X we have One can then show that a k λ + λa k for λ = 1/L and hence f(y k ) f 1 A k x x 0 = d 0 A k. ( ) Ak λ A k+1 + = A k+1 λ A k + = A k k λ 4 = k 4L. Proof. Since Γ k (x) f(x), then A k f(y k ) A k f(x) + 1 x x 0, x X = A k f(y k ) A k f(x ) + 1 x x 0. Strongly Convex Case Suppose we start with the following two assumptions (A1) f is differentiable on X and for L > 0 we have f(x) f( x) L x x for all x, x X (A) f is µ-strongly convex We then have that (A1), (A) imply that for x, x X, l f ( x, x) + µ x x f( x) l f ( x, x) + L x x Algorithm 3. The Nesterov Algorithm for µ-strongly convex functions under (A1), (A) is (0) Let x 0 R n be given and set y 0 = x 0, k = 0, A 0 = 0, 1 L λ 1 L µ. (1) Compute () Set k k + 1 and go to (1). Note that a k = (A k + a k )λ k = A k+1 λ k. λ k = (1 + µa k )λ a k = 1 + λ k + 4λ ka k A k+1 = A k + a k x k = A k y k + a k x k A k+1 A k+1 ˆx k = P X (ˆx k ) y k+1 = argmin x X x k+1 = x k l f (x; ˆx k ) + 1 a k 1 + A k µ λ x ˆx k + µ } x ˆx k [ ] yk+1 x k + µ(y k+1 x k ) λ Proposition.10. Let q(y) be a µ-strongly convex function such that q f on X. For λ > 0 and ˆx R n, define ŷ = argmin q(y) + 1 } y x. y X λ

28 Winter 017 ALGORITHMS Then the function satisfies (a) γ(ŷ) = q(ŷ) (b) ŷ = argmin y Y q(y) + 1 λ y x }. (c) γ is µ-strongly convex on R n (d) γ q on X which implies γ f on X ˆx ŷ γ(y) = q(ŷ) + λ, y ŷ + µ y ŷ Aside (for the exam). If φ minφ(x)} and φ is β-strongly convex, with x = argmin x φ(x) then φ + β x x φ(x). Aside (for the exam). If f is µ-strongly convex, then λf + 1 x x 0 is (λµ + 1) strongly convex. Proposition.11. For every k 0 define Γ k (y) = k 1 i=0 a iγ i (y) A k, y R n (1) = A k Γ k = A k 1 Γ k 1 + a k 1 γ k 1 where ˆxk y k+1 γ k (y) = q k (y k+1 ) +, y y k+1 + µ λ y y k+1 q k (y) = l f (y; ˆx k ) + µ y ˆx. Then we have (a) Γ k is µ-strongly convex (b) γ k q k f on X (c) Γ k f on X (d) x k = argmin x R n Ak Γ k (x) + 1 x x 0 } (e) A k f(y k ) mina k Γ k (x) + 1 x x 0 } Proof. (a) Obvious. (b) Use the fact that q k f on X and γ k q k on X follows from the previous proposition. (c) Γ k f on X follows from (1) and the fact that γ i f on X (d) and (e) By induction on k. For k = 0, it is obvious since A 0 = 0. First, assume that (d) k and (e k ) holds. Then for all x R n we have A k Γ k (x) + 1 x x 0 A k f(y k ) + A kµ + 1 x x k.

29 Winter 017 ALGORITHMS So, min A k+1 Γ k+1 (x) + 1 } x x 0 x R n = min A k Γ k (x) + a k γ k (x) + 1 } x x 0 x R n min A k f k (x) + A } kµ + 1 x x k + a k γ k (x) x R n min A k γ k (x) + A } kµ + 1 x x k + a k γ k (x) x R n min (A k + a k )γ k A k y k + a k x x R n A k + a k + A kµ + 1 } } x = min A k+1 γ k ( x) + A kµ + 1 x R n =A k+1 min x R n =A k+1 min x R n γ k ( x) + λ k λ A k+1 a k γ k ( x) + 1 λ x x k } A k+1 a x x k k } x x k } x x k Now f(y k+1 ) l f (y k+1 ; ˆx k ) + L y k+1 ˆx k l f (y k+1 ; ˆx k ) + µ y k+1 ˆx k + L µ y k+1 x k q k (y k+1 ) + 1 λ y k+1 x k } = min y X = min y R n q k (y) + 1 y x λ γ k (y) + 1 } y x λ and hence min x R n Ak+1 Γ k+1 (x) + 1 x x 0 } f(y k+1 ). Let us prove that By (d) k, A k Γ k (x k ) + x k x 0 = 0 and also (d) k+1 A k+1 Γ k+1 (x k+1 ) + x k+1 x 0 = 0. Γ k (x) = Γ k ( x) + µ(x x), x, x R n, k 1. (i) So, x k+1 x 0 + A k+1 Γ k+1 (x k+1 ) =x k+1 x 0 + A k+1 [ Γ k+1 (x k ) + µ(x k+1 x k )] =x k+1 x 0 + A k+1 Γ k+1 (x k ) + A k+1 µ(x k+1 x k ) =x k+1 x 0 + A k Γ k (x k ) + a k γ k (x k ) + µa k+1 (x k+1 x k ) (i) =(1 + µa k+1 )(x k+1 x k ) + a k γ k (x k ) [ ] xk y k+1 = a k + µ(x k y k+1 ) + a k γ k (x k ) λ =0

30 Winter 017 ALGORITHMS Corollary.6. For all k 1 and x X we have Proof. (e) implies that f(y k ) f 1 A k x 0 x A k f(y k ) A k Γ k (x ) + 1 x x 0 A k f(x ) + 1 x x 0 Proposition.1. For every k 1 we have Proof. Note that we have and hence A k max k 4L, 1 L a k λ k + λ k A k A k+1 =A k + a k A k A 1 (1 + ( ) (k 1) } µ 1 +. L = λ k + λ k A k + A k ( ) Ak λk = + + λ k 4 ( ) Ak Ak µλ + + µa kλ 4 ( ) =A k µλ µλ 4 ) µλ A k (1 + ) µ A k (1 + L The first part of the maximum is from the original Nesterov method. ) (k 1) ( ) (k 1) µ µ = λ 1 +. L L.9 Quasi-Newton Methods Quasi-Newton Method s General Scheme (0) Let x 0 R n and H 0 R n n symmetric and H 0 > 0 be given. (1) For k = 0, 1,,... set d k = H k g k x k+1 = x k + α k d k.

31 Winter 017 ALGORITHMS Update H k to obtain H k+1 > 0 and symmetric. Here, we want H k [ f(x k )] 1. Motivation Let Then, and if f is quadratic then q k = f(x k )p k. Secant Equation p k = H k+1 q k which comes from our above approximation. Rank-One Updates (SR1) H k+1 = H k + a k z k z T k where a k R and z k R n. We want q k = g k+1 g k p k = x k+1 x k. q k = f(x k )p k + o( p k ) p k = H k+1 q k = H k q k + a k (z T k q k )z k and so z k is proportional to p k H k q k. If we choose z k = p k H k q k then 1 = a k (zk T q k ) = a k (p k H k q k ) T 1 q k = a k = (p k H k q k ) T q k and we are left with the update H k+1 = H k + (p k H k q k ) T (p k H k q k ) T (p k H k q k ) T q k Rank-Two Updates H k+1 = H k + +auu T + bvv T for a R and u, v R n. The secant equation implies that If we choose u = p k and v = H k q k and enforce that p k = H k+1 q k = H k q k + a(u T q k )u + b(v T q k )v. a(p T k q k ) = 1 = a = 1 p T k q k b(qk T 1 H k q k ) = 1 = b = qk T H kq k then we have the Davidon-Fletcher-Powell (DFP) method with the update Hk+1 DF P = H k + p kp T k p T k q H kq k qk T H k k qk T H. kq k Lemma.15. For c, d R n, we have c d c T d and equality holds if and only if c, d are colinear. Theorem.7. If p T k q k > 0 for all k 0 then all H k s generated in the above way is positive definite and symmetric. Proof. We proceed by induction on k. For k = 0, it is obvious since H 0 > 0 assumption. Assume that H k > 0 for some k 0. Let x 0 be given. Then, x T H k+1 x = x T H k x + (pt k x) (p T k q k) (qt k H kx) qk T H. kq k

32 Winter 017 ALGORITHMS Let c = H 1/ k x and d = H 1/ q k. Then k x T H k+1 = c (ct d) d + (pt k x) (p T k q k) = c d (c T d) d + (pt k x) (p T k q k) 0 from the previous lemma. Claim. x T H k+1 x > 0 Proof. Assume for contradiction that x T H k+1 x = 0. Then p T k 0 = p T k x = λqt k p k 0 and H k+1 > 0 as required. = 0 and c, d are colinear. That is x = λq k for λ 0. Hence Question 1. How can we guarantee the following condition for α k > 0? 0 < q T k p k = (g k+1 q k ) T (α k d k ) = α k (g T k+1d k g T k d k ) Solution. It is enough to enforce g T k+1 d k > g T k d k. An example of such an inexact line search is the Wolfe-Powell line search with 0 < σ < τ < 1. In particular, it has the conditions (1) f(x k + α k d k ) f(x k ) + α k σg T k d k () g k+1 d k τg k d k > g k d k Sherman-Morrison Formula Proposition.13. Assume that A = B + USV T where S R m m, A, B R n n non-singular and U, V R n m. If P = S 1 + V T S 1 U is non-singular then A 1 = B 1 B 1 UP 1 V T B 1 Other Rank-Two Updates We could try the following iteration scheme x k+1 = x k α k B 1 k g k, B k f(x k ) where B k+1 is obtained from B k by the following rank two formula (B k p k = q k ): BF GS Bk+1 = B k + q kqk T qk T p B kp k p T k B k k p T k B. kp k We call this the Broyden-Fletcher-Goldfarb-Shannon (BFGS) update. Using inversion, we can use the Sherman-Morrison formula to get BF GS H BF GS k+1 = (Bk+1 ) 1 = H k + ( 1 + qt k H ) kq k pk p T k qk T p k p T k q p kp T k H k + H k q k p T k k qk T p k BF GS where A = Bk+1, B = B k and U = [q k, B k p k ], V = U and S = [ 1 p T k q k p T k B kp k ]. Broyden s Family of Algorithms Let φ = φ k R. Then the method is defined as H φ k+1 P = (1 φ)hdf k+1 + φh = φh DF P k+1 + φv k v T k BF GS k+1

33 Winter 017 ALGORITHMS where ( v k = (qk T H k q k ) 1/ pk p T k q k H kq k q T k H kq k ). Theorem.8. If H k > 0, p T k q k > 0, φ 0 then H φ k+1 > 0. Theorem.9. If f(x) = 1 xt Qx b T x + c with Q > 0 then for every k 0 such that g k 0 we have: (1) H φ k+1 q j = p j for j = 0, 1,..., k () p T j Qp i = 0 for 0 i < j k (3) p 0,..., p k are nonzero Hence, the method terminates in m n iterations. If m = n then H n = Q 1. Remark.10. Since q j = Qp j then H k+1 q j = p j = (H k+1 Q)p j = q j for j = 0, 1,..., k and so H k+1 Q acts like an identity operator on a particular subspace. In particular, (H k+1 Q)x = x for all x [p 0,..., p k ]. Theorem. If H 0 = I then the iterates generated by Broyden s Quasi-Newton method, with the exact line search method, are identical to those generated by the conjugate gradient method. Convergence Result for General f Theorem.10. Let f : R n R C (R n ) and x 0 R n be such that (1) S = x R n : f(x) f(x 0 )} is bounded and convex () f(x) > 0 for all x S Let x k } be a sequence generated by the Broyden Quasi-Newton method x k = x k α k H φ k k g k where φ k [0, 1] and H 0 = I and α k is chosen by the W-P rule and α k = 1 is the first attempted step size. Then, superlinearly in the sense that where x is the unique global minimum of f over S. Limited Memory Quasi-Newton Methods lim x k = x k x k+1 x lim k x k x = 0 The general formula for a Quasi-Newton method is ( ) ( φ(h, p, q) = H qt Hq pp T pq T p T q p T q H + Hqp T ) p T q ) ) = (I pqt p T H (1 qpt q p T + ppt q p T q BF GS and in particular, Hk = φ(h k 1, p k 1, q k 1 ). The idea for the limited memory variant is that we store the latest pairs (p i, q i ) for i = k 1,..i, k m and generate H k recursively through the steps 1. H = H k 0 (simple, say H = I). For i = k m,..., k 1 set H φ(h, p i, q i ) 3. H k = H It turns out this scheme makes the calculation of H k g k very easy and the intermediate H matrices simple as well. The following is a full description of the algorithm.

34 Winter CONSTRAINED OPTIMIZATION Algorithm 4. (For computing H k g) u g k for i = k 1,...,, k m p T i u p T i qi α i u u α i q i end for r H0 k u for i = k m,..., k 1 β q T i r p T i qi r r + (α i β)p i end for H k g k r 3 Constrained Optimization The standard constrained optimization problem in this section will be denoted by (ECP ) min f(x) s.t. h i (x) = 0, i = 1,,..., m, x R n, f, h i C (R n ) Definition 3.1. We say that x R n is a regular point of (ECP) if h 1 (x),..., h m (x) are linearly independent (equivalently h(x) = [ h 1 (x)... h m (x)] is full column rank). Remark 3.1. If x is a regular point, the matrix is nonsingular. h(x) T h(x) R m m Theorem 3.1. (Lagrange Multiplier Theorem - First order necessary optimality conditions) If x is a regular local minimum of (ECP), then there exists a unique (!) λ R m such that f(x ) + m λ i h i (x ) = 0. i=1 More compactly, we have f(x ) + h(x )λ = 0. Proof. (construction) There exists ɛ > 0 such that Let α > 0 be given and, for every k N, let where existence is guaranteed by the Weierstrass theorem. Claim 3.1. lim k x k = x. f(x) f(x ), x B(x ; ɛ) = S s.t. h(x) = 0. (0) x k argmin F k (x) := f(x) + k x S h(x) + α x x

35 Winter CONSTRAINED OPTIMIZATION Proof. (of claim) For all k we have F k (x k ) F k (x ) f(x k ) + k h(x k) + α x k x f(x ). (1) Since f(x) is bounded on S, we have f(x k )} is bounded. As k, we have lim h(x k) = 0. () k Let x be an accumulation point of x k }. By (), we have h( x) = 0 and by (1) we have f( x) + α x x f(x ). Since x S and h( x) = 0, by (0), we have and by (3),(4), x x = 0. f( x) f(x ) (4) (Th. proof cont.) For all k sufficiently large, x k int(s) and hence F k (x k ) = 0 and F k (x k ) 0. Now, where λ k = kh(x k ). Claim 3.. λ k } λ for some λ R m. Proof. We have 0 = F k (x k ) = f(x k ) + k h(x k )h(x k ) + α(x k x ) = f(x k ) + h(x k )λ k + α(x k x ) h(x k ) T h(x k )λ k = h(x k ) T [ f(x k ) + α(x k x )] = λ k = [ h(x k ) T h(x k ) ] 1 h(xk ) T [ f(x k ) + α(x k x )] = lim k λ k = [ h(x ) T h(x ) ] 1 h(x ) T [ f(x )] := λ (Th. proof cont.) Taking limits with the above results gives f(x ) + h(x )λ = 0. Theorem 3.. (Second Order Necessary Conditions) If x is a regular local minimum of (ECP), then there exists a unique λ R m such that f(x ) + h(x )λ = 0 and for all d V (x ) where Proof. Define d T ( f(x ) + h(x )λ ) d 0 V (x ) = d R n : h(x ) T d = 0}. F k (x) := f(x) + k h(x) + α x x and note for all k sufficiently large, 0 F k (x k ) = f(x k ) + h i (x k )λ k + k h(x k ) h(x k ) T + αi

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

MATH 4211/6211 Optimization Quasi-Newton Method

MATH 4211/6211 Optimization Quasi-Newton Method MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:

More information

Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The notion of complexity (per iteration)

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen

More information

Quasi-Newton methods for minimization

Quasi-Newton methods for minimization Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS Universitá di Trento November 21 December 14, 2011 Quasi-Newton methods for minimization 1

More information

5 Quasi-Newton Methods

5 Quasi-Newton Methods Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Algorithms for constrained local optimization

Algorithms for constrained local optimization Algorithms for constrained local optimization Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Algorithms for constrained local optimization p. Feasible direction methods Algorithms for constrained

More information

1 Numerical optimization

1 Numerical optimization Contents 1 Numerical optimization 5 1.1 Optimization of single-variable functions............ 5 1.1.1 Golden Section Search................... 6 1.1. Fibonacci Search...................... 8 1. Algorithms

More information

More First-Order Optimization Algorithms

More First-Order Optimization Algorithms More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM

More information

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

More information

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS) Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

March 8, 2010 MATH 408 FINAL EXAM SAMPLE March 8, 200 MATH 408 FINAL EXAM SAMPLE EXAM OUTLINE The final exam for this course takes place in the regular course classroom (MEB 238) on Monday, March 2, 8:30-0:20 am. You may bring two-sided 8 page

More information

Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding. Kurt Hornik Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding

More information

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction

More information

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005 3 Numerical Solution of Nonlinear Equations and Systems 3.1 Fixed point iteration Reamrk 3.1 Problem Given a function F : lr n lr n, compute x lr n such that ( ) F(x ) = 0. In this chapter, we consider

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema Chapter 7 Extremal Problems No matter in theoretical context or in applications many problems can be formulated as problems of finding the maximum or minimum of a function. Whenever this is the case, advanced

More information

1 Numerical optimization

1 Numerical optimization Contents Numerical optimization 5. Optimization of single-variable functions.............................. 5.. Golden Section Search..................................... 6.. Fibonacci Search........................................

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

The Conjugate Gradient Algorithm

The Conjugate Gradient Algorithm Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm Optimization over a Subspace Consider the problem min f (x) subject to x

More information

Inequality Constraints

Inequality Constraints Chapter 2 Inequality Constraints 2.1 Optimality Conditions Early in multivariate calculus we learn the significance of differentiability in finding minimizers. In this section we begin our study of the

More information

Examination paper for TMA4180 Optimization I

Examination paper for TMA4180 Optimization I Department of Mathematical Sciences Examination paper for TMA4180 Optimization I Academic contact during examination: Phone: Examination date: 26th May 2016 Examination time (from to): 09:00 13:00 Permitted

More information

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 18: November Review on Primal-dual interior-poit methods 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Lecturer: Javier Pena Lecture 18: November 2 Scribes: Scribes: Yizhu Lin, Pan Liu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0 Numerical Analysis 1 1. Nonlinear Equations This lecture note excerpted parts from Michael Heath and Max Gunzburger. Given function f, we seek value x for which where f : D R n R n is nonlinear. f(x) =

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Numerical Optimization of Partial Differential Equations

Numerical Optimization of Partial Differential Equations Numerical Optimization of Partial Differential Equations Part I: basic optimization concepts in R n Bartosz Protas Department of Mathematics & Statistics McMaster University, Hamilton, Ontario, Canada

More information

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23 Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is

More information

Notes on Numerical Optimization

Notes on Numerical Optimization Notes on Numerical Optimization University of Chicago, 2014 Viva Patel October 18, 2014 1 Contents Contents 2 List of Algorithms 4 I Fundamentals of Optimization 5 1 Overview of Numerical Optimization

More information

10 Numerical methods for constrained problems

10 Numerical methods for constrained problems 10 Numerical methods for constrained problems min s.t. f(x) h(x) = 0 (l), g(x) 0 (m), x X The algorithms can be roughly divided the following way: ˆ primal methods: find descent direction keeping inside

More information

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν 1 Search Directions In this chapter we again focus on the unconstrained optimization problem P min f(x), x R n where f : R n R is assumed to be twice continuously differentiable, and consider the selection

More information

Gradient-Based Optimization

Gradient-Based Optimization Multidisciplinary Design Optimization 48 Chapter 3 Gradient-Based Optimization 3. Introduction In Chapter we described methods to minimize (or at least decrease) a function of one variable. While problems

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Numerical Methods for Large-Scale Nonlinear Systems

Numerical Methods for Large-Scale Nonlinear Systems Numerical Methods for Large-Scale Nonlinear Systems Handouts by Ronald H.W. Hoppe following the monograph P. Deuflhard Newton Methods for Nonlinear Problems Springer, Berlin-Heidelberg-New York, 2004 Num.

More information

Lecture 10: September 26

Lecture 10: September 26 0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study International Journal of Mathematics And Its Applications Vol.2 No.4 (2014), pp.47-56. ISSN: 2347-1557(online) Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms:

More information

Static unconstrained optimization

Static unconstrained optimization Static unconstrained optimization 2 In unconstrained optimization an objective function is minimized without any additional restriction on the decision variables, i.e. min f(x) x X ad (2.) with X ad R

More information

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number Limited-Memory Matrix Methods with Applications 1 Tamara Gibson Kolda 2 Applied Mathematics Program University of Maryland at College Park Abstract. The focus of this dissertation is on matrix decompositions

More information

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization E5295/5B5749 Convex optimization with engineering applications Lecture 8 Smooth convex unconstrained and equality-constrained minimization A. Forsgren, KTH 1 Lecture 8 Convex optimization 2006/2007 Unconstrained

More information

Lecture 1: Introduction. Outline. B9824 Foundations of Optimization. Fall Administrative matters. 2. Introduction. 3. Existence of optima

Lecture 1: Introduction. Outline. B9824 Foundations of Optimization. Fall Administrative matters. 2. Introduction. 3. Existence of optima B9824 Foundations of Optimization Lecture 1: Introduction Fall 2009 Copyright 2009 Ciamac Moallemi Outline 1. Administrative matters 2. Introduction 3. Existence of optima 4. Local theory of unconstrained

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Nonlinear equations. Norms for R n. Convergence orders for iterative methods Nonlinear equations Norms for R n Assume that X is a vector space. A norm is a mapping X R with x such that for all x, y X, α R x = = x = αx = α x x + y x + y We define the following norms on the vector

More information

FIXED POINT ITERATIONS

FIXED POINT ITERATIONS FIXED POINT ITERATIONS MARKUS GRASMAIR 1. Fixed Point Iteration for Non-linear Equations Our goal is the solution of an equation (1) F (x) = 0, where F : R n R n is a continuous vector valued mapping in

More information

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7, 2010 wax@alum.mit.edu 1 Chapter 5 (Conjugate Gradient Methods) Notes

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Optimization Theory. A Concise Introduction. Jiongmin Yong

Optimization Theory. A Concise Introduction. Jiongmin Yong October 11, 017 16:5 ws-book9x6 Book Title Optimization Theory 017-08-Lecture Notes page 1 1 Optimization Theory A Concise Introduction Jiongmin Yong Optimization Theory 017-08-Lecture Notes page Optimization

More information

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION by Darin Griffin Mohr An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss

More information

Penalty and Barrier Methods General classical constrained minimization problem minimize f(x) subject to g(x) 0 h(x) =0 Penalty methods are motivated by the desire to use unconstrained optimization techniques

More information

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH Jie Sun 1 Department of Decision Sciences National University of Singapore, Republic of Singapore Jiapu Zhang 2 Department of Mathematics

More information

Miscellaneous Nonlinear Programming Exercises

Miscellaneous Nonlinear Programming Exercises Miscellaneous Nonlinear Programming Exercises Henry Wolkowicz 2 08 21 University of Waterloo Department of Combinatorics & Optimization Waterloo, Ontario N2L 3G1, Canada Contents 1 Numerical Analysis Background

More information

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction ISSN 1749-3889 (print), 1749-3897 (online) International Journal of Nonlinear Science Vol.11(2011) No.2,pp.153-158 Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method Yigui Ou, Jun Zhang

More information

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Optimization Methods. Lecture 19: Line Searches and Newton s Method 15.93 Optimization Methods Lecture 19: Line Searches and Newton s Method 1 Last Lecture Necessary Conditions for Optimality (identifies candidates) x local min f(x ) =, f(x ) PSD Slide 1 Sufficient Conditions

More information

Statistics 580 Optimization Methods

Statistics 580 Optimization Methods Statistics 580 Optimization Methods Introduction Let fx be a given real-valued function on R p. The general optimization problem is to find an x ɛ R p at which fx attain a maximum or a minimum. It is of

More information

Convex Optimization CMU-10725

Convex Optimization CMU-10725 Convex Optimization CMU-10725 Quasi Newton Methods Barnabás Póczos & Ryan Tibshirani Quasi Newton Methods 2 Outline Modified Newton Method Rank one correction of the inverse Rank two correction of the

More information

On nonlinear optimization since M.J.D. Powell

On nonlinear optimization since M.J.D. Powell On nonlinear optimization since 1959 1 M.J.D. Powell Abstract: This view of the development of algorithms for nonlinear optimization is based on the research that has been of particular interest to the

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Optimization II: Unconstrained Multivariable 1

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

EECS260 Optimization Lecture notes

EECS260 Optimization Lecture notes EECS260 Optimization Lecture notes Based on Numerical Optimization (Nocedal & Wright, Springer, 2nd ed., 2006) Miguel Á. Carreira-Perpiñán EECS, University of California, Merced May 2, 2010 1 Introduction

More information

MATH 4211/6211 Optimization Basics of Optimization Problems

MATH 4211/6211 Optimization Basics of Optimization Problems MATH 4211/6211 Optimization Basics of Optimization Problems Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 A standard minimization

More information

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Introduction to Nonlinear Stochastic Programming

Introduction to Nonlinear Stochastic Programming School of Mathematics T H E U N I V E R S I T Y O H F R G E D I N B U Introduction to Nonlinear Stochastic Programming Jacek Gondzio Email: J.Gondzio@ed.ac.uk URL: http://www.maths.ed.ac.uk/~gondzio SPS

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell DAMTP 2014/NA02 On fast trust region methods for quadratic models with linear constraints M.J.D. Powell Abstract: Quadratic models Q k (x), x R n, of the objective function F (x), x R n, are used by many

More information

Unconstrained optimization I Gradient-type methods

Unconstrained optimization I Gradient-type methods Unconstrained optimization I Gradient-type methods Antonio Frangioni Department of Computer Science University of Pisa www.di.unipi.it/~frangio frangio@di.unipi.it Computational Mathematics for Learning

More information

CONSTRAINED NONLINEAR PROGRAMMING

CONSTRAINED NONLINEAR PROGRAMMING 149 CONSTRAINED NONLINEAR PROGRAMMING We now turn to methods for general constrained nonlinear programming. These may be broadly classified into two categories: 1. TRANSFORMATION METHODS: In this approach

More information

David Hilbert was old and partly deaf in the nineteen thirties. Yet being a diligent

David Hilbert was old and partly deaf in the nineteen thirties. Yet being a diligent Chapter 5 ddddd dddddd dddddddd ddddddd dddddddd ddddddd Hilbert Space The Euclidean norm is special among all norms defined in R n for being induced by the Euclidean inner product (the dot product). A

More information

2.3 Linear Programming

2.3 Linear Programming 2.3 Linear Programming Linear Programming (LP) is the term used to define a wide range of optimization problems in which the objective function is linear in the unknown variables and the constraints are

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Chapter 10 Conjugate Direction Methods

Chapter 10 Conjugate Direction Methods Chapter 10 Conjugate Direction Methods An Introduction to Optimization Spring, 2012 1 Wei-Ta Chu 2012/4/13 Introduction Conjugate direction methods can be viewed as being intermediate between the method

More information

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers. Chapter 3 Duality in Banach Space Modern optimization theory largely centers around the interplay of a normed vector space and its corresponding dual. The notion of duality is important for the following

More information