Lecture 10: September 26

Similar documents
The Conjugate Gradient Algorithm

The Conjugate Gradient Method

Chapter 10 Conjugate Direction Methods

FALL 2018 MATH 4211/6211 Optimization Homework 4

Convex Optimization CMU-10725

Lecture 16: October 22

Lecture 6: September 12

Lecture 5: September 12

Nonlinear Programming

Programming, numerics and optimization

Unconstrained optimization

Lecture 5: September 15

Lecture 14: October 17

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

MATH 4211/6211 Optimization Quasi-Newton Method

Lecture 15: October 15

Lecture 6: September 17

The Steepest Descent Algorithm for Unconstrained Optimization

Lecture 9: September 28

Lecture 11: October 2

Lecture 17: October 27

Lecture 25: November 27

Conjugate Gradient (CG) Method

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 1: September 25

An Iterative Descent Method

Lecture 17: Primal-dual interior-point methods part II

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

1 Numerical optimization

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 24: August 28

1 Numerical optimization

Quasi-Newton Methods

Lecture 18: November Review on Primal-dual interior-poit methods

x k+1 = x k + α k p k (13.1)

Lecture 23: November 21

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 14: Newton s Method

Chapter 4. Unconstrained optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Lecture 7 Unconstrained nonlinear programming

Optimization methods

Lecture 14: October 11

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Numerical Optimization of Partial Differential Equations

Conjugate gradient algorithm for training neural networks

Optimization Tutorial 1. Basic Gradient Descent

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

EECS 275 Matrix Computation

Constrained optimization: direct methods (cont.)

Lecture 5: September Time Complexity Analysis of Local Alignment

Unconstrained minimization of smooth functions

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 7: September 17

Gradient Descent Methods

Nonlinear Optimization for Optimal Control

Second Order Optimization Algorithms I

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Gradient-Based Optimization

The Conjugate Gradient Method

Quasi-Newton methods for minimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Numerical Optimization

Notes on Some Methods for Solving Linear Systems

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Numerical Optimization

Conjugate Gradient Method

Lecture 2: August 31

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

ECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Conjugate Directions for Stochastic Gradient Descent

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

IE 5531: Engineering Optimization I

Lecture 6: September 19

Constrained Optimization

Lec10p1, ORF363/COS323

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

ISyE 6663 (Winter 2017) Nonlinear Optimization

Lecture 23: November 19

Line Search Methods for Unconstrained Optimisation

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Lecture 3 September 1

Chapter 8 Gradient Methods

Computational Linear Algebra

Lecture 23: Conditional Gradient Method

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

and P RP k = gt k (g k? g k? ) kg k? k ; (.5) where kk is the Euclidean norm. This paper deals with another conjugate gradient method, the method of s

Transcription:

0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. This lecture s notes illustrate some uses of various L A TEX macros. Take a look at this and imitate. 0. Motivation From previous lectures, we know that steepest decent method is very slow. The Newton method is fast. But the computation of inverse of the Hessian matrix is very expensive. Can we find a method between these two? Conjugate direction method can be regarded as being between the method of steepest decent and Newton s method. 0.2 Conjugate Direction Methods The goal includes two folds. ()Accelarate the convergence rate of steepest decent. (2)Avoiding the high computational cost Newton s method 0.2. Definition[Q-conjugate directions] Let Q be a symmetric matrix d, d 2,..., d k vectors (d i R n, d i 0) are Q-orthogonal (conjugate) w.e.r Q, if d T i Qd j = 0, i j note In the application we consider, the matrix Q will be positive definite. But this is not inherent in the basc definition. If Q=0, any two vectors are conjugate. If Q=I, conjugacy is equivalent to the usual notion of orthogonality. 0.2.2 Linear Independency Lemma lemma Let Q be positive definite. If d, d 2,..., d k vectors are Q-conjugate, then they are linearl independent. 0-

0-2 Lecture 0: September 26 proof by contradiction If d k = α d +... + α k, then 0 < = d T k Q(α d +... + α k d k ) = α d T k Qd +...α k = 0 0.2.3 The importance of Q-conjugancy For quadratic problem, our goal is to solve arg min x R n 2 xt Qx b T x Assume Q is positive definite The first order differiential of the objective function equals to 0 for the minimizer. So the unique solution to this problem is the solution to Qx = b, x R n Let x denote the solution. Let d 0, d,..., d n vectors be Q-conjugate. Since d 0, d,..., d n vectors are independent, x = α 0 d 0 +... + α n d n Therefore, d T i Qx = d T i Q(α 0d 0 +... + α n d n ) = α i d T i Qd i we don t need to know x to get α i α i = dt i Qx d T i Qd i d x = α i d i = i=0 = dt i b d T i Qd i d i=0 d T i b d T i Qd i We can see that there is no need to do matrix inversion. We only need to calculate inner product. d i 0.2.4 Conjugate Direction Theorem The expansion for x can be considered to be the result of an iterative process of n steps where at the ith steps α i d i is added. This can be generalized further such a way that the starting point of the iteration can be arbitery x 0 We introduce the conjugate direction theorem here. Theorem[Conjugate Direction Theorem] Let d 0, d,..., d n vectors be Q-conjugate. x 0 R n be an arbitery starting point. x k+ = x k + α k d k

Lecture 0: September 26 0-3 g k = Qx k b Then after n steps, x n = x α k = gt k d k = (Qx k b) T d k Proof Since {d 0, d,..., d n } vectors are independent. x x 0 = α 0 d 0 +... + α n d n for some α 0,..., α n Using the x k+ = x k + α k d k update rules, we have x = x 0 + α 0 d 0 x 2 = x 0 + α 0 d 0 + α x k = x 0 + α 0 d 0 + α +...α k d k x n = x 0 + α 0 d 0 + α d +... + α n d n = x span Therefore, it s enough to prove that with these α k values we have α k = gt k d k We already know x x 0 = α 0 d 0 +... + α n d n x k x 0 = α 0 d 0 +... + α k d k span Therefore, d T k Q(x x 0 ) = d T k Q(α 0 d 0 +... + α n d n = α k α k = dt k Q(x x 0 ) d T k Q(x k x 0 ) = d T k Q(α 0 d 0 + α d +... + α k d k ) = 0 d k Q(x x 0 ) = d T k Q(x x k + x k x 0 ) = d T k Q(x x k ) α k = dt k Q(x x 0 ) = dt k Q(x x k ) = dt k g k

0-4 Lecture 0: September 26 Another motivation for Q-conjugacy Our goal is: arg min x R n 2 xt Qx b T x x x 0 = α 0 d 0 +... + α n d n for some {α i } n i=0 R Therefore, f(x) = n 2 [x 0+ j=0 n α j d j ] T Q[x 0 + j=0 n α j d j ] b T [x 0 + j=0 n α j d j ]f(x) = c+ j=0 2 [x 0+α j d j ] T Q[x 0 +α j d j ] b T [x 0 +α j d j ] The optimization problem is transformed into a n separate -dimensional optimization problems 0.2.5 Expanding Subspace Theorem Let B k = span(d 0,..., d k ) R n We will show as the method of conjugate directions progresses each x k minimizes the objective f(x) = 2 xt Qx b T x both over x 0 + B k and x k + αd k, α R.That is x k = arg min x=x k +αd k,α R 2 xt Qx b T xx k = arg min x=x 0+B k 2 xt Qx b T x Theorem[Expanding Subspace Theorem] Let {d i } n i=0 be a sequence of Q-conjugate vectors in Rn, x 0 R n arbitary x k+ = x k + α k d k x k = arg α k = gt k d k min x=x k +αd k,α R 2 xt Qx b T x x k = arg min x=x 0+B k 2 xt Qx b T x Proof It s enough to show that x k minimizes f on x = x 0 + B k since it contains the line:x = x k + αd k (by the definition of B k ) Since f is strictly convex, it s enough to show that g k = f (x k ) is orthogonal to B k We prove g k B k by induction k=0: B 0 is empty set. Assume that g k B k and prove that g k+ B k+ By definition, x k+ = x k + α k d k g k = Qx k b Therefore,

Lecture 0: September 26 0-5 g k+ = Qx k+ b = Q(x k + α k d k ) b = (Qx k b) + α k Qd k span = g k + α k Qd k First, let us prove that g k+ d k We have proved g k+ = g k + α k Qd k By definition, α k = gt k d k Therefore, d T k g k+ = d T k (g k + α k Qd k ) = d T k d T k g k = 0 g k+ d k Now let us prove that g k+ d i, i < k Since g k+ = g k + α k Qd k g k B k = span(d 0,..., d k ) Therefore, d T i g k+ = d T i (g k + α k Qd k ) = d T i g k + α k d T i Qd k = 0 g k+ d i, i < k So far we have proved that g k+ B k+, where B k+ = span(d 0,..., d k ) Corollary of Exponential subspace theorem corollary g k B k = span(d 0,..., d k ) g T k d i = 0, 0 i k = B 0... B k B n = R n span since x k minimizes f over x 0 + B k. x n is the minimum of f in R n

0-6 Lecture 0: September 26 0.3 Conjugate gradient method The conjugate gradient method isa conjugate direction method Selects the successive direction vectors as a conjugate version of the successive gradients obtained as the method progress The conjugate directions are not specified beforehandm but rather are determined sequentially at each step of the iteration 0.3. Description and advantage So far given d 0,..., d n, we already have an update rule for α k. α k = gt K d k d T K Qd k The problem is how should we choose vectors d 0,..., d n. The conjugate gradient method is the conjugate direction method that is obtained by selecting the successive direction vectors as a conjugate version of the successive gradients obtained as the method progresses. Thus, the directions are not specified beforehand, but rather are determined sequentially at each step of the iteration. There are three primary advantages to this method of direction selection.. Unless the solution is attained in less than n steps, the gradient is always nonzero and linearly independent of all previous direction vectors. 2. A more important advantage of the conjugate gradient method is the especially simple formula that is used to determine the new direction vector. This simplicity makes the method only slightly more complicated than steepest descent. 3. Because the directions are based on the gradients, the process makes good uniform progress toward the solution at every step. This is in contrast to the situation for arbitrary sequences of conjugate directions in which progress may be slight until the final few steps. 0.3.2 Algorithm Let x 0 R n be arbitrary. d 0 = g 0 = b Qx 0 α k = gt d k x k+ = x k + α k d k g k = Qx k b d k+ = g k+ + β k d k β k = gt k+ Qd k

Lecture 0: September 26 0-7 In the algorithm the first step is identical to a steepest descent step; each succeeding step moves in a direction that is a linear combination of the current gradient and the preceding direction vector. The attractive feature of the algorithm is the simple formula for updating in the direction vector. The method is only slightly more complicated to implement than the method of steepest descent but converges in a finite number of steps. Theorem 0. The conjugate gradient algorithm is a conjugate direction method. If it does not terminate at x k, then a) span(g 0, g,..., g k ) = span(g 0, Qg 0,..., Q k g 0 ) b) span(d 0, d,..., d k ) = span(g 0, Qg 0,..., Q k g 0 ) c) d T k Qd i = 0, i < k d) α k = gt k g k e) β k = gt k+ g k+ g T k g k Proof: We first prove a), b) and c) simultaneously by induction. Clearly, they are true for k = 0. Now suppose they are true for k, we show that they are true for k +. We have g k+ = g k + α k Qd k By the induction hypothesis both g k and Qd k belong to span(g 0, Qg 0,..., Q k+ g 0 ), the first by a) and the second by b). Thus g k+ span(g 0, Qg 0,..., Q k+ g 0 ). Furthermore g k+ / span(d 0, d,..., d k ) = span(g 0, Qg 0,..., Q k g 0 ) since otherwise g k+ = 0, because for any conjugate direction method g k+ is orthogonal to span(d 0, d,..., d k )]. (The induction hypothesis on c) guarantees that the method is a conjugate direction method up to x k+.) Thus, finally we conclude that which proves a). To prove b) we write span(g 0, g,..., g k ) = span(g 0, Qg 0,..., Q k g 0 ) d k+ = g k+ + β k d k, and b) immediately follows form a) and the induction hypothesis on b). Next, to prove c we have d T k+qd i = g T k+qd i + β k d T k Qd i. For i = k the right side is zero by definition of β k. For i < k both terms vanish. The first term vanishes since Qd i span(d 0, d,..., d k+ ), the induction hypothesis which guarantees the method is a conjugate direction method up to x k+, and by the Expanding Subspace Theorem that guarantees that g k+ is orthogonal to span(d 0, d,..., d k+ ). The second term vanishes by the induction hypothesis on c). This proves c), which also proves that the method is a conjugate direction method. To prove d) we have g T k d k = g T k g k β k g T k d k, and the second term is zero by the Expanding Subspace Theorem.

0-8 Lecture 0: September 26 Finally, to prove e) we note that g T k+ g k = 0, because g k span(d 0, d,..., d k ) and g k+ span(d 0, d,..., d k ). Thus since Qd k = α k (g k+ g k ). We have g T k+qd k = α k g T k+g k+. 0.4 Extension to non-quadratic problem 0.4. Description Our goal is to min f(x) x R n We will make quadratic approximation g k = f(x k ) and Q = 2 f(x k ). using these associations, reevaluated at each step, all quantities necessary to implement the basic conjugate gradient algorithm can be evaluated. If f is quadratic, these associations are identities, so that the general algorithm obtained by using them is a generalization of the conjugate gradient scheme. This is similar to the philosophy underlying Newtons method where at each step the solution of a general problem is approximated by the solution of a purely quadratic problem through these same associations. When applied to nonquadratic problems, conjugate gradient methods will not usually terminate within n steps. It is possible therefore simply to continue finding new directions according to the algorithm and terminate only when some termination criterion is met. Therefore after n steps, we can restart the process from this point and run the algorithm for another n steps. 0.4.2 Algorithm. Starting at x 0 and compute g 0 = f(x 0 ) and set d 0 = g 0. 2. For k = 0,,..., n : a) Set x k+ = α k d k where α k = b) Compute g k+ = f(x k+ ) g T k d k d T k [ 2 f(x k )]d k c) Unless k = n, set d k+ = g k+ + β k d k where β k = gt k+ [ f(x k)]d k d T k [ 2 f(x k )]d k 3. Replace x 0 by x n and go back to step. Until converge. 0.4.3 Properties of CGA An attractive feature of the algorithm is that, just as in the pure form of Newtons method, no line searching is required at any stage. Also, the algorithm converges in a finite number of steps for a quadratic problem. The undesirable features are that F (x k ) must be evaluated at each point, which is often impractical, and that the algorithm is not, in this form, globally convergent.

Lecture 0: September 26 0-9 0.5 Line search method Two of the line search method are FletcherReeves method and Ploak-Ribiere method. This algorithm is quite similar to the Conjugate gradient method. The slight differences are in step 2 a) and 2 c) above. 0.5. Algorithm for Fletcher Reeves method. Starting at x 0 and compute g 0 = f(x 0 ) and set d 0 = g 0. 2. For k = 0,,..., n : a) Set x k+ = α k d k where α k = arg min α f(x k + αd k ) b) Compute g k+ = f(x k+ ) c) Unless k = n, set d k+ = g k+ + β k d k where β k = gt k+ g k+ g T k g k 3. Replace x 0 by x n and go back to step. Until converge. Firstly Fletcher-Reeves method is line search method when evaluating α and there is no close form solution normally. Secondly the Hessian is not used in the algorithm. Also in the quadratic case it is identical to the original conjugate direction algorithm. 0.5.2 Algorithm for Polak Ribiere method Polak-Ribiere method is quite similar to Fletcher-Reeves method. The only difference occurs when evaluating β k.. Starting at x 0 and compute g 0 = f(x 0 ) and set d 0 = g 0. 2. For k = 0,,..., n : a) Set x k+ = α k d k where α k = arg min α f(x k + αd k ) b) Compute g k+ = f(x k+ ) c) Unless k = n, set d k+ = g k+ + β k d k where β k = (g k+ g k ) T g k+ g T k g k 3. Replace x 0 by x n and go back to step. Until converge. Again this leads to a value identical to the standard formula in the quadratic case. Experimental evidence seems to favor the PolakRibiere method over other methods of this general type.

0-0 Lecture 0: September 26 0.6 Convergence Global convergence of the line search methods is established by noting that a pure steepest descent step is taken every n steps and serves as a spacer step. Since the other steps do not increase the objective, and in fact hopefully they decrease it, global convergence is assured. Thus the restarting aspect of the algorithm is important for global convergence analysis, since in general one cannot guarantee that the directions d k generated by the method are descent directions. The local convergence properties of both of the above, and most other, nonquadratic extensions of the conjugate gradient method can be inferred from the quadratic analysis. Since one complete cycle solves a quadratic problem exactly just as Newtons method does in one step, we expect that for general nonquadratic problems there will hold x k+n x c x k x 2 for some c and k = 0, n, 2n, 3n,.... This can indeed be proved, and of course underlies the original motivation for the method. For problems with large n, however, a result of this type is in itself of little comfort, since we probably hope to terminate in fewer than n steps. 0.7 summary Conjugate Direction Methods. - conjugate directions Minimizing quadratic functions Conjugate Gradient Methods for nonquadratic functions - Line search methods. FletcherReeves method 2. PolakRibiere method References [CW87] Luenberger, David G and Ye, Yinyu, Linear and nonlinear programming, Springer, 2008, vol. 6.