Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Size: px

Start display at page:

Download "Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem"

Adam O’Brien’
5 years ago
Views:

1 Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,... d k - direction. t k - stepsize. We will limit ourselves to descent directions. Definition. Let f : R n R be a continuously differentiable function over R n. A vector 0 d R n is called a descent direction of f at x if the directional derivative f (x; d) is negative, meaning that f (x; d) = f (x) T d < 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 1 / 33

2 The Descent Property of Descent Directions Lemma: Let f be a continuously differentiable function over R n, and let x R n. Suppose that d is a descent direction of f at x. Then there exists ε > 0 such that f (x + td) < f (x) Proof. for any t (0, ε]. Since f (x; d) < 0, it follows from the definition of the directional derivative that f (x + td) f (x) lim = f (x; d) < 0. t 0 + t Therefore, ε > 0 such that f (x + td) f (x) < 0 t for any t (0, ε], which readily implies the desired result. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 2 / 33

3 Schematic Descent Direction Method Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... set (a) pick a descent direction d k. (b) find a stepsize t k satisfying f (x k + t k d k ) < f (x k ). (c) set x k+1 = x k + t k d k. (d) if a stopping criteria is satisfied, then STOP and x k+1 is the output. Of course, many details are missing in the above schematic algorithm: What is the starting point? How to choose the descent direction? What stepsize should be taken? What is the stopping criteria? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 3 / 33

4 Stepsize Selection Rules constant stepsize - t k = t for any k. exact stepsize - t k is a minimizer of f along the ray x k + td k : t k argmin f (x k + td k ). t 0 backtracking 1 - The method requires three parameters: s > 0, α (0, 1), β (0, 1). Here we start with an initial stepsize t k = s. While f (x k ) f (x k + t k d k ) < αt k f (x k ) T d k. set t k := βt k Sufficient Decrease Property: f (x k ) f (x k + t k d k ) αt k f (x k ) T d k. 1 also referred to as Armijo Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 4 / 33

5 Exact Line Search for Quadratic Functions f (x) = x T Ax + 2b T x + c where A is an n n positive definite matrix, b R n and c R. Let x R n and let d R n be a descent direction of f at x. The objective is to find a solution to min f (x + td). t 0 In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 5 / 33

6 The Gradient Method - Taking the Direction of Minus the Gradient In the gradient method d k = f (x k ). This is a descent direction as long as f (x k ) 0 since f (x k ; f (x k )) = f (x k ) T f (x k ) = f (x k ) 2 < 0. In addition for being a descent direction, minus the gradient is also the steepest direction method. Lemma: Let f be a continuously differentiable function and let x R n be a non-stationary point ( f (x) 0). Then an optimal solution of is d = f (x)/ f (x). Proof. In class min d {f (x; d) : d = 1} (1) Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 6 / 33

7 The Gradient Method The Gradient Method Input: ε > 0 - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) pick a stepsize t k by a line search procedure on the function g(t) = f (x k t f (x k )). (b) set x k+1 = x k t k f (x k ). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 7 / 33

8 Numerical Example x 0 = (2; 1), ε = 10 5, exact line search. min x 2 + 2y 2 13 iterations until convergence. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 8 / 33

9 The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Therefore, we need to prove that f (x k ) T f (x k+1 ) = 0. t k argmin{g(t) f (x k t f (x k ))} t 0 Hence, g (t k ) = 0. f (x k ) T f (x k t k f (x k )) = 0. f (x k ) T f (x k+1 ) = 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

10 Numerical Example - Constant Stepsize, t = 0.1 x 0 = (2; 1), ε = 10 5, t = 0.1. min x 2 + 2y 2 iter_number = 1 norm_grad = fun_val = iter_number = 2 norm_grad = fun_val = iter_number = 3 norm_grad = fun_val = : : : iter_number = 56 norm_grad = fun_val = iter_number = 57 norm_grad = fun_val = iter_number = 58 norm_grad = fun_val = quite a lot of iterations... Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 10 / 33

11 Numerical Example - Constant Stepsize, t = 10 x 0 = (2; 1), ε = 10 5, t = 10.. min x 2 + 2y 2 iter_number = 1 norm_grad = fun_val = iter_number = 2 norm_grad = fun_val = iter_number = 3 norm_grad = fun_val = : : : iter_number = 119 norm_grad = NaN fun_val = NaN The sequence diverges:( Important question: how can we choose the constant stepsize so that convergence is guaranteed? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 11 / 33

12 Lipschitz Continuity of the Gradient Definition Let f be a continuously differentiable function over R n. We say that f has a Lipschitz gradient if there exists L 0 for which f (x) f (y) L x y for any x, y R n. L is called the Lipschitz constant. If f is Lipschitz with constant L, then it is also Lipschitz with constant L for all L L. The class of functions with Lipschitz gradient with constant L is denoted by C 1,1 L (Rn ) or just C 1,1 L. Linear functions - Given a R n, the function f (x) = a T x is in C 1,1 0. Quadratic functions - Let A be a symmetric n n matrix, b R n and c R. Then the function f (x) = x T Ax + 2b T x + c is a C 1,1 function. The smallest Lipschitz constant of f is A 2 why? In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 12 / 33

13 Equivalence to Boundedness of the Hessian Theorem. Let f be a twice continuously differentiable function over R n. Then the following two claims are equivalent: 1. f C 1,1 L (Rn ) f (x) L for any x R n. Proof on pages 73,74 of the book Example: f (x) = 1 + x 2 C 1,1 In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 13 / 33

14 Convergence of the Gradient Method Theorem. Let {x k } k 0 be the sequence generated by GM for solving min f (x) x R n with one of the following stepsize strategies: constant stepsize t ( 0, 2 L). exact line search. backtracking procedure with parameters s > 0 and α, β (0, 1). Assume that f C 1,1 L (Rn ). f is bounded below over R n, that is, there exists m R such that f (x) > m for all x R n ). Then 1. for any k, f (x k+1 ) < f (x k ) unless f (x k ) = f (x k ) 0 as k. Theorem 4.25 in the book. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 14 / 33

15 Two Numerical Examples - Backtracking min x 2 + 2y 2 x 0 = (2; 1), s = 2, α = 0.25, β = 0.5, ε = iter_number = 1 norm_grad = fun_val = iter_number = 2 norm_grad = fun_val = fast convergence (also due to lack!) no real advantage to exact line search. ANOTHER EXAMPLE: min 0.01x 2 + y 2, s = 2, α = 0.25, β = 0.5, ε = iter_number = 1 norm_grad = fun_val = iter_number = 2 norm_grad = fun_val = iter_number = 3 norm_grad = fun_val = : : : iter_number = 201 norm_grad = fun_val = Important Question: Can we detect key properties of the objective function that imply slow/fast convergence? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 15 / 33

16 Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Thus, the eigenvalues of A + MmA 1 are smaller than (M + m). A + MmA 1 (M + m)i. x T Ax + Mm(x T A 1 x) (M + m)(x T x), Therefore, (x T Ax)[Mm(x T A 1 x)] 1 [ (x T Ax) + Mm(x T A 1 x) ] 2 (M + m) 2 (x T x) 2, 4 4 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

17 Gradient Method for Minimizing x T Ax Theorem. Let {x k } k 0 be the sequence generated by the gradient method with exact linesearch for solving the problem Then for any k = 0, 1,...: min x R xt Ax (A 0). n f (x k+1 ) where M = λ max (A), m = λ min (A). ( ) 2 M m f (x k ), M + m Proof. x k+1 = x k t k d k, where t k = dt k d k 2d T k Ad, d k = 2Ax k. k Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 17 / 33

18 Proof of Rate of Convergence Contd. f (x k+1 ) = x T k+1ax k+1 = (x k t k d k ) T A(x k t k d k ) = x T k Ax k 2t k d T k Ax k + tk 2 d T k Ad k = x T k Ax k t k d T k d k + tk 2 d T k Ad k. Plugging in the expression for t k By Kantorovich: f (x k+1 ) f (x k+1 ) = x T k Ax k 1 (d T k d k ) 2 4 d T k Ad k ( = x T k Ax k = ( 1 4Mm (M + m) 2 ( 1 ) (d T k d k ) 2 (d T k Ad k)(x T k AA 1 Ax k ) ) f (x k ). (d T k d k ) 2 (d T k Ad k)(d T k A 1 d k ) ) f (x k ) = ( ) 2 M m f (x k ) = M + m ( ) 2 κ(a) 1 f (x k ), κ(a) + 1 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 18 / 33

19 The Condition Number Definition. Let A be an n n positive definite matrix. Then the condition number of A is defined by κ(a) = λ max(a) λ min (A). matrices (or quadratic functions) with large condition number are called ill-conditioned. matrices with small condition number are called well-conditioned. large condition number implies large number of iterations of the gradient method. small condition number implies small number of iterations of the gradient method. For a non-quadratic function, the asymptotic rate of convergence of x k to a stationary point x is usually determined by the condition number of 2 f (x ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 19 / 33

20 A Severely Ill-Condition Function - Rosenbrock min { f (x 1, x 2 ) = 100(x 2 x 2 1 ) 2 + (1 x 1 ) 2}. optimal solution:(x 1, x 2 ) = (1, 1), optimal value: 0. f (x) = 2 f (x) = ( 400x1 (x 2 x1 2) 2(1 x ) 1) 200(x 2 x1 2), ( 400x x x ) x condition number: f (1, 1) = ( 802 ) Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 20 / 33

21 Solution of the Rosenbrock Problem with the Gradient Method x 0 = (2; 5), s = 2, α = 0.25, β = 0.5, ε = 10 5, backtracking stepsize selection. 6890(!!!) iterations. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 21 / 33

22 Sensitivity of Solutions to Linear Systems Suppose that we are given the linear system Ax = b where A 0 and we assume that x is indeed the solution of the system (x = A 1 b). Suppose that the right-hand side is perturbed b + b. What can be said on the solution of the new system x + x? x = A 1 b. Result (derivation In class): x x κ(a) b b Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 22 / 33

23 Numerical Example consider the ill-condition matrix: ( ) A = >> A=[1+1e-5,1;1,1+1e-5]; >> cond(a) ans = e+005 We have >> A\[1;1] ans = However, >> A\[1.1;1] ans = 1.0e+003 * Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 23 / 33

24 x k+1 = x k t k D f (x k ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 24 / 33 Scaled Gradient Method Consider the minimization problem (P) min{f (x) : x R n }. For a given nonsingular matrix S R n n, we make the linear change of variables x = Sy, and obtain the equivalent problem (P ) min{g(y) f (Sy) : y R n }. Since g(y) = S T f (Sy) = S T f (x), the gradient method for (P ) is y k+1 = y k t k S T f (Sy k ). Multiplying the latter equality by S from the left, and using the notation x k = Sy k : x k+1 = x k t k SS T f (x k ). Defining D = SS T, we obtain the scaled gradient method:

25 Scaled Gradient Method D 0, so the direction D f (x k ) is a descent direction: f (x k ; D f (x k )) = f (x k ) T D f (x k ) < 0, We also allow different scaling matrices at each iteration. Scaled Gradient Method Input: ε > 0 - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) pick a scaling matrix D k 0. (b) pick a stepsize t k by a line search procedure on the function (c) set x k+1 = x k t k D k f (x k ). g(t) = f (x k td k f (x k )). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 25 / 33

26 Choosing the Scaling Matrix D k The scaled gradient method with scaling matrix D is equivalent to the gradient method employed on the function g(y) = f (D 1/2 y). Note that the gradient and Hessian of g are given by g(y) = D 1/2 f (D 1/2 y) = D 1/2 f (x), 2 g(y) = D 1/2 2 f (D 1/2 y)d 1/2 = D 1/2 2 f (x)d 1/2.. The objective is usually to pick D k so as to make D 1/2 k 2 f (x k )D 1/2 k as well-conditioned as possible. A well known choice (Newton s method): D k = ( 2 f (x k )) 1. diagonal scaling: D k is picked to be diagonal. For example, ( 2 ) 1 f (x k ) (D k ) ii =. Diagonal scaling can be very effective when the decision variables are of different magnitudes. x 2 i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 26 / 33

27 The Gauss-Newton Method Nonlinear least squares problem: { (NLS): min x R n g(x) } m (f i (x) c i ) 2. f 1,..., f m are continuously differentiable over R n and c 1,..., c m R. Denote: Then the problem becomes: i=1 f 1 (x) c 1 f 2 (x) c 2 F (x) =., f m (x) c m min F (x) 2. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 27 / 33

28 The Gauss-Newton Method Given the kth iterate x k, the next iterate is chosen to minimize the sum of squares of the linearized terms, that is, { m } [ x k+1 = argmin fi (x k ) + f i (x k ) T ] 2 (x x k ) c i. x R n i=1 The general step actually consists of solving the linear LS problem min A k x b k 2, where f 1(x k ) T f 2(x k ) T A k =. = J(x k) f m(x k ) T is the so-called Jacobian matrix, assumed to have full column rank. f 1(x k ) T x k f 1(x k ) + c 1 f 2(x k ) T x k f 2(x k ) + c 2 b k =. = J(x k)x k F (x k ) f m(x k ) T x k f m(x k ) + c m Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 28 / 33

29 The Gauss-Newton Method The Gauss-Newton method can thus be written as: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T b k. The gradient of the objective function f (x) = F (x) 2 is f (x) = 2J(x)F (x) The GN method can be rewritten as follows: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T (J(x k )x k F (x k )) = x k (J(x k ) T J(x k )) 1 J(x k ) T F (x k ) = x k 1 2 (J(x k) T J(x k )) 1 f (x k ), that is, it is a scaled gradient method with a special choice of scaling matrix: D k = 1 2 (J(x k) T J(x k )) 1. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 29 / 33

30 The Damped Gauss-Newton Method The Gauss-Newton method does not incorporate a stepsize, which might cause it to diverge. A well known variation of the method incorporating stepsizes is the damped Gauss-newton Method. Damped Gauss-Newton Method Input: ε - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) Set d k = (J(x k ) T J(x k )) 1 J(x k )F (x k ). (b) Set t k by a line search procedure on the function (c) set x k+1 = x k + t k d k. h(t) = g(x k + td k ). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 30 / 33

31 Fermat-Weber Problem Fermat-Weber Problem: Given m points in R n : a 1,..., a m also called anchor point and m weights ω 1, ω 2,..., ω m > 0, find a point x R n that minimizes the weighted distance of x to each of the points a 1,..., a m : { } min x R n f (x) m ω i x a i i=1. The objective function is not differentiable at the anchor points a 1,..., a m. One of the simplest instances of facility location problems. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 31 / 33

32 Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i ( m ω i i=1 x a i x = 1 x a i = 0. ) x = m i=1 m ω i a i m ω i i=1 x a i=1 x a i i. ω i a i x a i, The stationarity condition can be written as x = T (x), where T is the operator 1 m ω i a i T (x) x a i. m i=1 ω i x a i Weiszfeld s method is a fixed point method: i=1 x k+1 = T (x k ). 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

33 Weiszfeld s Method as a Gradient Method Weiszfeld s Method Initialization: pick x 0 R n such that x a 1, a 2,..., a m. General step: for any k = 0, 1, 2,... compute: x k+1 = T (x k ) = m i=1 1 ω i x k a i m i=1 ω i a i x k a i. Weiszfeld s method is a gradient method since 1 m x k+1 = m i=1 = x k = x k ω i x k a i m i=1 m i=1 1 i=1 ω i x k a i 1 ω i x k a i ω i a i x k a i m i=1 ω i f (x k ). x k a i x k a i A gradient method with a special choice of stepsize: t k = 1 m i=1 ω i x k a i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 33 / 33.

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...