Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Similar documents
Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Unconstrained minimization of smooth functions

Unconstrained optimization

Lecture 2 - Unconstrained Optimization Definition[Global Minimum and Maximum]Let f : S R be defined on a set S R n. Then

The Steepest Descent Algorithm for Unconstrained Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Optimization Methods. Lecture 19: Line Searches and Newton s Method

5 Overview of algorithms for unconstrained optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 3: Linesearch methods (continued). Steepest descent methods

Line Search Methods for Unconstrained Optimisation

Nonlinear Optimization: What s important?

10. Unconstrained minimization

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Written Examination

14. Nonlinear equations

Conjugate Gradient (CG) Method

4 damped (modified) Newton methods

Unconstrained Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Unconstrained minimization: assumptions

Math 273a: Optimization Netwon s methods

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization

Descent methods. min x. f(x)

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Self-Concordant Barrier Functions for Convex Optimization

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Lecture Notes: Geometric Considerations in Unconstrained Optimization

A nonlinear equation is any equation of the form. f(x) = 0. A nonlinear equation can have any number of solutions (finite, countable, uncountable)

Unconstrained minimization

Nonlinear Optimization

17 Solution of Nonlinear Systems

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Nonlinear Optimization for Optimal Control

Numerical Optimization of Partial Differential Equations

Lecture 5: September 15

Introduction to Nonlinear Optimization Paul J. Atzberger

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

8 Numerical methods for unconstrained problems

Lecture 5: September 12

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

c 2007 Society for Industrial and Applied Mathematics

Newton s Method. Javier Peña Convex Optimization /36-725

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Mathematical modelling Chapter 2 Nonlinear models and geometric models

6. Proximal gradient method

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Higher-Order Methods

Chapter 0. Mathematical Preliminaries. 0.1 Norms

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

March 5, 2012 MATH 408 FINAL EXAM SAMPLE

OPER 627: Nonlinear Optimization Lecture 2: Math Background and Optimality Conditions

Complexity of gradient descent for multiobjective optimization

Unconstrained optimization I Gradient-type methods

Image restoration: numerical optimisation

6. Proximal gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Programming, numerics and optimization

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

Iterative Methods for Solving A x = b

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Tangent spaces, normals and extrema

Conjugate Gradient Method

Lecture 14: October 17

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

Optimization Methods

5 Handling Constraints

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

Miscellaneous Nonlinear Programming Exercises

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Mathematical optimization

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Scientific Computing: Optimization

Numerical solutions of nonlinear systems of equations

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

IE 5531: Engineering Optimization I

Arc Search Algorithms

On the convergence properties of the modified Polak Ribiére Polyak method with the standard Armijo line search

Numerical Methods for Differential Equations Mathematical and Computational Tools

Optimization Tutorial 1. Basic Gradient Descent

A globally and quadratically convergent primal dual augmented Lagrangian algorithm for equality constrained optimization

Numerical optimization

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Gradient Descent. Dr. Xiaowei Huang

Chapter 7 Iterative Techniques in Matrix Algebra

Equality constrained minimization

x k+1 = x k + α k p k (13.1)

Transcription:

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,... d k - direction. t k - stepsize. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 1 / 33

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,... d k - direction. t k - stepsize. We will limit ourselves to descent directions. Definition. Let f : R n R be a continuously differentiable function over R n. A vector 0 d R n is called a descent direction of f at x if the directional derivative f (x; d) is negative, meaning that f (x; d) = f (x) T d < 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 1 / 33

The Descent Property of Descent Directions Lemma: Let f be a continuously differentiable function over R n, and let x R n. Suppose that d is a descent direction of f at x. Then there exists ε > 0 such that f (x + td) < f (x) for any t (0, ε]. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 2 / 33

The Descent Property of Descent Directions Lemma: Let f be a continuously differentiable function over R n, and let x R n. Suppose that d is a descent direction of f at x. Then there exists ε > 0 such that f (x + td) < f (x) Proof. for any t (0, ε]. Since f (x; d) < 0, it follows from the definition of the directional derivative that f (x + td) f (x) lim = f (x; d) < 0. t 0 + t Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 2 / 33

The Descent Property of Descent Directions Lemma: Let f be a continuously differentiable function over R n, and let x R n. Suppose that d is a descent direction of f at x. Then there exists ε > 0 such that f (x + td) < f (x) Proof. for any t (0, ε]. Since f (x; d) < 0, it follows from the definition of the directional derivative that f (x + td) f (x) lim = f (x; d) < 0. t 0 + t Therefore, ε > 0 such that f (x + td) f (x) < 0 t for any t (0, ε], which readily implies the desired result. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 2 / 33

Schematic Descent Direction Method Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... set (a) pick a descent direction d k. (b) find a stepsize t k satisfying f (x k + t k d k ) < f (x k ). (c) set x k+1 = x k + t k d k. (d) if a stopping criteria is satisfied, then STOP and x k+1 is the output. Of course, many details are missing in the above schematic algorithm: What is the starting point? How to choose the descent direction? What stepsize should be taken? What is the stopping criteria? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 3 / 33

Stepsize Selection Rules constant stepsize - t k = t for any k. exact stepsize - t k is a minimizer of f along the ray x k + td k : t k argmin f (x k + td k ). t 0 backtracking 1 - The method requires three parameters: s > 0, α (0, 1), β (0, 1). Here we start with an initial stepsize t k = s. While f (x k ) f (x k + t k d k ) < αt k f (x k ) T d k. set t k := βt k Sufficient Decrease Property: f (x k ) f (x k + t k d k ) αt k f (x k ) T d k. 1 also referred to as Armijo Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 4 / 33

Exact Line Search for Quadratic Functions f (x) = x T Ax + 2b T x + c where A is an n n positive definite matrix, b R n and c R. Let x R n and let d R n be a descent direction of f at x. The objective is to find a solution to min f (x + td). t 0 In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 5 / 33

The Gradient Method - Taking the Direction of Minus the Gradient In the gradient method d k = f (x k ). This is a descent direction as long as f (x k ) 0 since f (x k ; f (x k )) = f (x k ) T f (x k ) = f (x k ) 2 < 0. In addition for being a descent direction, minus the gradient is also the steepest direction method. Lemma: Let f be a continuously differentiable function and let x R n be a non-stationary point ( f (x) 0). Then an optimal solution of is d = f (x)/ f (x). Proof. In class min d {f (x; d) : d = 1} (1) Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 6 / 33

The Gradient Method The Gradient Method Input: ε > 0 - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) pick a stepsize t k by a line search procedure on the function g(t) = f (x k t f (x k )). (b) set x k+1 = x k t k f (x k ). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 7 / 33

Numerical Example x 0 = (2; 1), ε = 10 5, exact line search. min x 2 + 2y 2 13 iterations until convergence. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 8 / 33

The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Therefore, we need to prove that f (x k ) T f (x k+1 ) = 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Therefore, we need to prove that f (x k ) T f (x k+1 ) = 0. t k argmin{g(t) f (x k t f (x k ))} t 0 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Therefore, we need to prove that f (x k ) T f (x k+1 ) = 0. t k argmin{g(t) f (x k t f (x k ))} t 0 Hence, g (t k ) = 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

The Zig-Zag Effect Lemma. Let {x k } k 0 be the sequence generated by the gradient method with exact line search for solving a problem of minimizing a continuously differentiable function f. Then for any k = 0, 1, 2,... (x k+2 x k+1 ) T (x k+1 x k ) = 0. Proof. x k+1 x k = t k f (x k ), x k+2 x k+1 = t k+1 f (x k+1 ). Therefore, we need to prove that f (x k ) T f (x k+1 ) = 0. t k argmin{g(t) f (x k t f (x k ))} t 0 Hence, g (t k ) = 0. f (x k ) T f (x k t k f (x k )) = 0. f (x k ) T f (x k+1 ) = 0. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 9 / 33

Numerical Example - Constant Stepsize, t = 0.1 x 0 = (2; 1), ε = 10 5, t = 0.1. min x 2 + 2y 2 iter_number = 1 norm_grad = 4.000000 fun_val = 3.280000 iter_number = 2 norm_grad = 2.937210 fun_val = 1.897600 iter_number = 3 norm_grad = 2.222791 fun_val = 1.141888 : : : iter_number = 56 norm_grad = 0.000015 fun_val = 0.000000 iter_number = 57 norm_grad = 0.000012 fun_val = 0.000000 iter_number = 58 norm_grad = 0.000010 fun_val = 0.000000 quite a lot of iterations... Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 10 / 33

Numerical Example - Constant Stepsize, t = 10 x 0 = (2; 1), ε = 10 5, t = 10.. min x 2 + 2y 2 iter_number = 1 norm_grad = 1783.488716 fun_val = 476806.000000 iter_number = 2 norm_grad = 656209.693339 fun_val = 56962873606.00 iter_number = 3 norm_grad = 256032703.004797 fun_val = 83183008071 : : : iter_number = 119 norm_grad = NaN fun_val = NaN The sequence diverges:( Important question: how can we choose the constant stepsize so that convergence is guaranteed? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 11 / 33

Lipschitz Continuity of the Gradient Definition Let f be a continuously differentiable function over R n. We say that f has a Lipschitz gradient if there exists L 0 for which f (x) f (y) L x y for any x, y R n. L is called the Lipschitz constant. If f is Lipschitz with constant L, then it is also Lipschitz with constant L for all L L. The class of functions with Lipschitz gradient with constant L is denoted by C 1,1 L (Rn ) or just C 1,1 L. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 12 / 33

Lipschitz Continuity of the Gradient Definition Let f be a continuously differentiable function over R n. We say that f has a Lipschitz gradient if there exists L 0 for which f (x) f (y) L x y for any x, y R n. L is called the Lipschitz constant. If f is Lipschitz with constant L, then it is also Lipschitz with constant L for all L L. The class of functions with Lipschitz gradient with constant L is denoted by C 1,1 L (Rn ) or just C 1,1 L. Linear functions - Given a R n, the function f (x) = a T x is in C 1,1 0. Quadratic functions - Let A be a symmetric n n matrix, b R n and c R. Then the function f (x) = x T Ax + 2b T x + c is a C 1,1 function. The smallest Lipschitz constant of f is 2 A 2 why? In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 12 / 33

Equivalence to Boundedness of the Hessian Theorem. Let f be a twice continuously differentiable function over R n. Then the following two claims are equivalent: 1. f C 1,1 L (Rn ). 2. 2 f (x) L for any x R n. Proof on pages 73,74 of the book Example: f (x) = 1 + x 2 C 1,1 In class Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 13 / 33

Convergence of the Gradient Method Theorem. Let {x k } k 0 be the sequence generated by GM for solving min f (x) x R n with one of the following stepsize strategies: constant stepsize t ( 0, 2 L). exact line search. backtracking procedure with parameters s > 0 and α, β (0, 1). Assume that f C 1,1 L (Rn ). f is bounded below over R n, that is, there exists m R such that f (x) > m for all x R n ). Then 1. for any k, f (x k+1 ) < f (x k ) unless f (x k ) = 0. 2. f (x k ) 0 as k. Theorem 4.25 in the book. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 14 / 33

Two Numerical Examples - Backtracking min x 2 + 2y 2 x 0 = (2; 1), s = 2, α = 0.25, β = 0.5, ε = 10 5. iter_number = 1 norm_grad = 2.000000 fun_val = 1.000000 iter_number = 2 norm_grad = 0.000000 fun_val = 0.000000 fast convergence (also due to lack!) no real advantage to exact line search. ANOTHER EXAMPLE: min 0.01x 2 + y 2, s = 2, α = 0.25, β = 0.5, ε = 10 5. iter_number = 1 norm_grad = 0.028003 fun_val = 0.009704 iter_number = 2 norm_grad = 0.027730 fun_val = 0.009324 iter_number = 3 norm_grad = 0.027465 fun_val = 0.008958 : : : iter_number = 201 norm_grad = 0.000010 fun_val = 0.000000 Important Question: Can we detect key properties of the objective function that imply slow/fast convergence? Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 15 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Thus, the eigenvalues of A + MmA 1 are smaller than (M + m). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Thus, the eigenvalues of A + MmA 1 are smaller than (M + m). A + MmA 1 (M + m)i. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Thus, the eigenvalues of A + MmA 1 are smaller than (M + m). A + MmA 1 (M + m)i. x T Ax + Mm(x T A 1 x) (M + m)(x T x), Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Kantorovich Inequality Lemma. Let A be a positive definite n n matrix. Then for any 0 x R n the inequality Proof. holds. x T x (x T Ax)(x T A 1 x) 4λ max(a)λ min (A) (λ max (A) + λ min (A)) 2 Denote m = λ min (A) and M = λ max (A). The eigenvalues of the matrix A + MmA 1 are λ i (A) + Mm λ i (A). The maximum of the 1-D function ϕ(t) = t + Mm t over [m, M] is attained at the endpoints m and M with a corresponding value of M + m. Thus, the eigenvalues of A + MmA 1 are smaller than (M + m). A + MmA 1 (M + m)i. x T Ax + Mm(x T A 1 x) (M + m)(x T x), Therefore, (x T Ax)[Mm(x T A 1 x)] 1 [ (x T Ax) + Mm(x T A 1 x) ] 2 (M + m) 2 (x T x) 2, 4 4 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 16 / 33

Gradient Method for Minimizing x T Ax Theorem. Let {x k } k 0 be the sequence generated by the gradient method with exact linesearch for solving the problem Then for any k = 0, 1,...: min x R xt Ax (A 0). n f (x k+1 ) where M = λ max (A), m = λ min (A). ( ) 2 M m f (x k ), M + m Proof. x k+1 = x k t k d k, where t k = dt k d k 2d T k Ad, d k = 2Ax k. k Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 17 / 33

Proof of Rate of Convergence Contd. f (x k+1 ) = x T k+1ax k+1 = (x k t k d k ) T A(x k t k d k ) = x T k Ax k 2t k d T k Ax k + tk 2 d T k Ad k = x T k Ax k t k d T k d k + tk 2 d T k Ad k. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 18 / 33

Proof of Rate of Convergence Contd. f (x k+1 ) = x T k+1ax k+1 = (x k t k d k ) T A(x k t k d k ) = x T k Ax k 2t k d T k Ax k + tk 2 d T k Ad k = x T k Ax k t k d T k d k + tk 2 d T k Ad k. Plugging in the expression for t k f (x k+1 ) = x T k Ax k 1 (d T k d k ) 2 4 d T k Ad k ( = x T k Ax k 1 1 4 = ( 1 ) (d T k d k ) 2 (d T k Ad k)(x T k AA 1 Ax k ) ) f (x k ). (d T k d k ) 2 (d T k Ad k)(d T k A 1 d k ) Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 18 / 33

Proof of Rate of Convergence Contd. f (x k+1 ) = x T k+1ax k+1 = (x k t k d k ) T A(x k t k d k ) = x T k Ax k 2t k d T k Ax k + tk 2 d T k Ad k = x T k Ax k t k d T k d k + tk 2 d T k Ad k. Plugging in the expression for t k By Kantorovich: f (x k+1 ) f (x k+1 ) = x T k Ax k 1 (d T k d k ) 2 4 d T k Ad k ( = x T k Ax k 1 1 4 = ( 1 4Mm (M + m) 2 ( 1 ) (d T k d k ) 2 (d T k Ad k)(x T k AA 1 Ax k ) ) f (x k ). (d T k d k ) 2 (d T k Ad k)(d T k A 1 d k ) ) f (x k ) = ( ) 2 M m f (x k ) = M + m ( ) 2 κ(a) 1 f (x k ), κ(a) + 1 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 18 / 33

The Condition Number Definition. Let A be an n n positive definite matrix. Then the condition number of A is defined by κ(a) = λ max(a) λ min (A). matrices (or quadratic functions) with large condition number are called ill-conditioned. matrices with small condition number are called well-conditioned. large condition number implies large number of iterations of the gradient method. small condition number implies small number of iterations of the gradient method. For a non-quadratic function, the asymptotic rate of convergence of x k to a stationary point x is usually determined by the condition number of 2 f (x ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 19 / 33

A Severely Ill-Condition Function - Rosenbrock min { f (x 1, x 2 ) = 100(x 2 x 2 1 ) 2 + (1 x 1 ) 2}. optimal solution:(x 1, x 2 ) = (1, 1), optimal value: 0. f (x) = 2 f (x) = ( 400x1 (x 2 x1 2) 2(1 x ) 1) 200(x 2 x1 2), ( 400x2 + 1200x1 2 + 2 400x ) 1. 400x 1 200 condition number: 2508 2 f (1, 1) = ( 802 ) 400 400 200 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 20 / 33

Solution of the Rosenbrock Problem with the Gradient Method x 0 = (2; 5), s = 2, α = 0.25, β = 0.5, ε = 10 5, backtracking stepsize selection. 6890(!!!) iterations. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 21 / 33

Sensitivity of Solutions to Linear Systems Suppose that we are given the linear system Ax = b where A 0 and we assume that x is indeed the solution of the system (x = A 1 b). Suppose that the right-hand side is perturbed b + b. What can be said on the solution of the new system x + x? x = A 1 b. Result (derivation In class): x x κ(a) b b Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 22 / 33

Numerical Example consider the ill-condition matrix: ( ) 1 + 10 5 1 A = 1 1 + 10 5 >> A=[1+1e-5,1;1,1+1e-5]; >> cond(a) ans = 2.000009999998795e+005 We have >> A\[1;1] ans = 0.499997500018278 0.499997500006722 However, >> A\[1.1;1] ans = 1.0e+003 * 5.000524997400047-4.999475002650021 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 23 / 33

Scaled Gradient Method Consider the minimization problem (P) min{f (x) : x R n }. For a given nonsingular matrix S R n n, we make the linear change of variables x = Sy, and obtain the equivalent problem (P ) min{g(y) f (Sy) : y R n }. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 24 / 33

Scaled Gradient Method Consider the minimization problem (P) min{f (x) : x R n }. For a given nonsingular matrix S R n n, we make the linear change of variables x = Sy, and obtain the equivalent problem (P ) min{g(y) f (Sy) : y R n }. Since g(y) = S T f (Sy) = S T f (x), the gradient method for (P ) is y k+1 = y k t k S T f (Sy k ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 24 / 33

Scaled Gradient Method Consider the minimization problem (P) min{f (x) : x R n }. For a given nonsingular matrix S R n n, we make the linear change of variables x = Sy, and obtain the equivalent problem (P ) min{g(y) f (Sy) : y R n }. Since g(y) = S T f (Sy) = S T f (x), the gradient method for (P ) is y k+1 = y k t k S T f (Sy k ). Multiplying the latter equality by S from the left, and using the notation x k = Sy k : x k+1 = x k t k SS T f (x k ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 24 / 33

x k+1 = x k t k D f (x k ). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 24 / 33 Scaled Gradient Method Consider the minimization problem (P) min{f (x) : x R n }. For a given nonsingular matrix S R n n, we make the linear change of variables x = Sy, and obtain the equivalent problem (P ) min{g(y) f (Sy) : y R n }. Since g(y) = S T f (Sy) = S T f (x), the gradient method for (P ) is y k+1 = y k t k S T f (Sy k ). Multiplying the latter equality by S from the left, and using the notation x k = Sy k : x k+1 = x k t k SS T f (x k ). Defining D = SS T, we obtain the scaled gradient method:

Scaled Gradient Method D 0, so the direction D f (x k ) is a descent direction: f (x k ; D f (x k )) = f (x k ) T D f (x k ) < 0, Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 25 / 33

Scaled Gradient Method D 0, so the direction D f (x k ) is a descent direction: f (x k ; D f (x k )) = f (x k ) T D f (x k ) < 0, We also allow different scaling matrices at each iteration. Scaled Gradient Method Input: ε > 0 - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) pick a scaling matrix D k 0. (b) pick a stepsize t k by a line search procedure on the function (c) set x k+1 = x k t k D k f (x k ). g(t) = f (x k td k f (x k )). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 25 / 33

Choosing the Scaling Matrix D k The scaled gradient method with scaling matrix D is equivalent to the gradient method employed on the function g(y) = f (D 1/2 y). Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 26 / 33

Choosing the Scaling Matrix D k The scaled gradient method with scaling matrix D is equivalent to the gradient method employed on the function g(y) = f (D 1/2 y). Note that the gradient and Hessian of g are given by g(y) = D 1/2 f (D 1/2 y) = D 1/2 f (x), 2 g(y) = D 1/2 2 f (D 1/2 y)d 1/2 = D 1/2 2 f (x)d 1/2.. The objective is usually to pick D k so as to make D 1/2 k 2 f (x k )D 1/2 k as well-conditioned as possible. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 26 / 33

Choosing the Scaling Matrix D k The scaled gradient method with scaling matrix D is equivalent to the gradient method employed on the function g(y) = f (D 1/2 y). Note that the gradient and Hessian of g are given by g(y) = D 1/2 f (D 1/2 y) = D 1/2 f (x), 2 g(y) = D 1/2 2 f (D 1/2 y)d 1/2 = D 1/2 2 f (x)d 1/2.. The objective is usually to pick D k so as to make D 1/2 k 2 f (x k )D 1/2 k as well-conditioned as possible. A well known choice (Newton s method): D k = ( 2 f (x k )) 1. diagonal scaling: D k is picked to be diagonal. For example, ( 2 ) 1 f (x k ) (D k ) ii =. x 2 i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 26 / 33

Choosing the Scaling Matrix D k The scaled gradient method with scaling matrix D is equivalent to the gradient method employed on the function g(y) = f (D 1/2 y). Note that the gradient and Hessian of g are given by g(y) = D 1/2 f (D 1/2 y) = D 1/2 f (x), 2 g(y) = D 1/2 2 f (D 1/2 y)d 1/2 = D 1/2 2 f (x)d 1/2.. The objective is usually to pick D k so as to make D 1/2 k 2 f (x k )D 1/2 k as well-conditioned as possible. A well known choice (Newton s method): D k = ( 2 f (x k )) 1. diagonal scaling: D k is picked to be diagonal. For example, ( 2 ) 1 f (x k ) (D k ) ii =. Diagonal scaling can be very effective when the decision variables are of different magnitudes. x 2 i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 26 / 33

The Gauss-Newton Method Nonlinear least squares problem: { (NLS): min x R n g(x) } m (f i (x) c i ) 2. f 1,..., f m are continuously differentiable over R n and c 1,..., c m R. Denote: Then the problem becomes: i=1 f 1 (x) c 1 f 2 (x) c 2 F (x) =., f m (x) c m min F (x) 2. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 27 / 33

The Gauss-Newton Method Given the kth iterate x k, the next iterate is chosen to minimize the sum of squares of the linearized terms, that is, { m } [ x k+1 = argmin fi (x k ) + f i (x k ) T ] 2 (x x k ) c i. x R n i=1 Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 28 / 33

The Gauss-Newton Method Given the kth iterate x k, the next iterate is chosen to minimize the sum of squares of the linearized terms, that is, { m } [ x k+1 = argmin fi (x k ) + f i (x k ) T ] 2 (x x k ) c i. x R n i=1 The general step actually consists of solving the linear LS problem min A k x b k 2, where f 1(x k ) T f 2(x k ) T A k =. = J(x k) f m(x k ) T is the so-called Jacobian matrix, assumed to have full column rank. f 1(x k ) T x k f 1(x k ) + c 1 f 2(x k ) T x k f 2(x k ) + c 2 b k =. = J(x k)x k F (x k ) f m(x k ) T x k f m(x k ) + c m Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 28 / 33

The Gauss-Newton Method The Gauss-Newton method can thus be written as: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T b k. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 29 / 33

The Gauss-Newton Method The Gauss-Newton method can thus be written as: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T b k. The gradient of the objective function f (x) = F (x) 2 is f (x) = 2J(x) T F (x) Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 29 / 33

The Gauss-Newton Method The Gauss-Newton method can thus be written as: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T b k. The gradient of the objective function f (x) = F (x) 2 is f (x) = 2J(x) T F (x) The GN method can be rewritten as follows: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T (J(x k )x k F (x k )) = x k (J(x k ) T J(x k )) 1 J(x k ) T F (x k ) = x k 1 2 (J(x k) T J(x k )) 1 f (x k ), Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 29 / 33

The Gauss-Newton Method The Gauss-Newton method can thus be written as: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T b k. The gradient of the objective function f (x) = F (x) 2 is f (x) = 2J(x) T F (x) The GN method can be rewritten as follows: x k+1 = (J(x k ) T J(x k )) 1 J(x k ) T (J(x k )x k F (x k )) = x k (J(x k ) T J(x k )) 1 J(x k ) T F (x k ) = x k 1 2 (J(x k) T J(x k )) 1 f (x k ), that is, it is a scaled gradient method with a special choice of scaling matrix: D k = 1 2 (J(x k) T J(x k )) 1. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 29 / 33

The Damped Gauss-Newton Method The Gauss-Newton method does not incorporate a stepsize, which might cause it to diverge. A well known variation of the method incorporating stepsizes is the damped Gauss-newton Method. Damped Gauss-Newton Method Input: ε - tolerance parameter. Initialization: pick x 0 R n arbitrarily. General step: for any k = 0, 1, 2,... execute the following steps: (a) Set d k = (J(x k ) T J(x k )) 1 J(x k ) T F (x k ). (b) Set t k by a line search procedure on the function (c) set x k+1 = x k + t k d k. h(t) = g(x k + td k ). (c) if f (x k+1 ) ε, then STOP and x k+1 is the output. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 30 / 33

Fermat-Weber Problem Fermat-Weber Problem: Given m points in R n : a 1,..., a m also called anchor point and m weights ω 1, ω 2,..., ω m > 0, find a point x R n that minimizes the weighted distance of x to each of the points a 1,..., a m : { } min x R n f (x) m ω i x a i i=1. The objective function is not differentiable at the anchor points a 1,..., a m. One of the simplest instances of facility location problems. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 31 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i x a i = 0. 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i ( m ω i i=1 x a i x a i = 0. ) x = m i=1 ω i a i x a i, 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i ( m ω i i=1 x a i x = 1 x a i = 0. ) x = m i=1 m ω i a i m ω i i=1 x a i=1 x a i i. ω i a i x a i, 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i ( m ω i i=1 x a i x = 1 x a i = 0. ) x = m i=1 m ω i a i m ω i i=1 x a i=1 x a i i. ω i a i x a i, The stationarity condition can be written as x = T (x), where T is the operator 1 m ω i a i T (x) x a i. m i=1 ω i x a i i=1 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method (1937) Start from the stationarity condition f (x) = 0. 2 m i=1 ω x a i i ( m ω i i=1 x a i x = 1 x a i = 0. ) x = m i=1 m ω i a i m ω i i=1 x a i=1 x a i i. ω i a i x a i, The stationarity condition can be written as x = T (x), where T is the operator 1 m ω i a i T (x) x a i. m i=1 ω i x a i Weiszfeld s method is a fixed point method: i=1 x k+1 = T (x k ). 2 We implicitly assume here that x is not an anchor point. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 32 / 33

Weiszfeld s Method as a Gradient Method Weiszfeld s Method Initialization: pick x 0 R n such that x a 1, a 2,..., a m. General step: for any k = 0, 1, 2,... compute: x k+1 = T (x k ) = m i=1 1 ω i x k a i m i=1 ω i a i x k a i. Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 33 / 33

Weiszfeld s Method as a Gradient Method Weiszfeld s Method Initialization: pick x 0 R n such that x a 1, a 2,..., a m. General step: for any k = 0, 1, 2,... compute: x k+1 = T (x k ) = m i=1 1 ω i x k a i m i=1 ω i a i x k a i. Weiszfeld s method is a gradient method since 1 m x k+1 = m i=1 = x k = x k ω i x k a i m i=1 m i=1 1 i=1 ω i x k a i 1 ω i x k a i ω i a i x k a i m i=1 ω i f (x k ). x k a i x k a i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 33 / 33

Weiszfeld s Method as a Gradient Method Weiszfeld s Method Initialization: pick x 0 R n such that x a 1, a 2,..., a m. General step: for any k = 0, 1, 2,... compute: x k+1 = T (x k ) = m i=1 1 ω i x k a i m i=1 ω i a i x k a i. Weiszfeld s method is a gradient method since 1 m x k+1 = m i=1 = x k = x k ω i x k a i m i=1 m i=1 1 i=1 ω i x k a i 1 ω i x k a i ω i a i x k a i m i=1 ω i f (x k ). x k a i x k a i A gradient method with a special choice of stepsize: t k = 1 m i=1 ω i x k a i Amir Beck Introduction to Nonlinear Optimization Lecture Slides - The Gradient Method 33 / 33.