The Conjugate Gradient Algorithm

Size: px

Start display at page:

Download "The Conjugate Gradient Algorithm"

Rosalyn Malone
6 years ago
Views:

2 Optimization over a Subspace Conjugate Direction Methods Conjugate Gradient Algorithm Non-Quadratic Conjugate Gradient Algorithm

3 Optimization over a Subspace Consider the problem min f (x) subject to x x 0 + S, where f : R n R is continuously differentiable and S is the subspace S := Span{v 1,..., v k }.

4 Optimization over a Subspace Consider the problem min f (x) subject to x x 0 + S, where f : R n R is continuously differentiable and S is the subspace S := Span{v 1,..., v k }. If V R n k has columns v 1,..., v k, then this problem is equivalent min f (x 0 + Vz) subject to z R k.

5 Subspace Optimality Condition min f (x 0 + Vz) subject to z R k. Set ˆf (z) = f (x 0 + Vz). If z solves this problem, then V T f (x 0 + V z) = ˆf ( z) = 0.

6 Subspace Optimality Condition min f (x 0 + Vz) subject to z R k. Set ˆf (z) = f (x 0 + Vz). If z solves this problem, then V T f (x 0 + V z) = ˆf ( z) = 0. Set x = x 0 + V z, we note that x solves the original problem if and only if z solves the problem above.

7 Subspace Optimality Condition min f (x 0 + Vz) subject to z R k. Set ˆf (z) = f (x 0 + Vz). If z solves this problem, then V T f (x 0 + V z) = ˆf ( z) = 0. Set x = x 0 + V z, we note that x solves the original problem if and only if z solves the problem above. Then V T f ( x) = 0, or equivalently, vi T f ( x) = 0 for i = 1, 2,..., k, go f ( x) Span(v 1,..., v k ).

8 Subspace Optimality Theorem Consider the problem P S min f (x) subject to x x 0 + S, where f : R n R is continuously differentiable and S is the subspace S := Span{v 1,..., v k }. If x solves P S, then f ( x) S.

9 Subspace Optimality Theorem Consider the problem P S min f (x) subject to x x 0 + S, where f : R n R is continuously differentiable and S is the subspace S := Span{v 1,..., v k }. If x solves P S, then f ( x) S. If it is further assumed that f is conves, then x solves P S if and only if f ( x) S.

10 Conjugate Direction Algorithm P : where f : R n R is C 2 is given by minimize f (x) subject to x R n f (x) := 1 2 x T Qx b T x with Q is a symmetric positive definite.

11 Conjugate Direction Algorithm Definition [Conjugacy] Let Q R n n be symmetric and positive definite. We say that the vectors x, y R n \{0} are Q-conjugate (or Q-orthogonal) if x T Qy = 0.

12 Conjugate Direction Algorithm Definition [Conjugacy] Let Q R n n be symmetric and positive definite. We say that the vectors x, y R n \{0} are Q-conjugate (or Q-orthogonal) if x T Qy = 0. Proposition [Conjugacy implies Linear Independence] If Q R n n is positive definite and the set of nonzero vectors d 0, d 1,..., d k are (pairwise) Q-conjugate, then these vectors are linearly independent.

13 Conjugate Direction Algorithm [Conjugate Direction Algorithm] Let {d i } n 1 i=0 be a set of nonzero Q-conjugate vectors. For any x 0 R n the sequence {x k } generated according to with x k+1 := x k + α k d k, k 0 α k := arg min{f (x k + αd k ) : α R} converges to the unique solution, x of P after n steps, that is x n = x.

14 Conjugate Direction Algorithm [Conjugate Direction Algorithm] Let {d i } n 1 i=0 be a set of nonzero Q-conjugate vectors. For any x 0 R n the sequence {x k } generated according to with x k+1 := x k + α k d k, k 0 α k := arg min{f (x k + αd k ) : α R} converges to the unique solution, x of P after n steps, that is x n = x. We have already shown that α k = f (x k )i T d k /d k Qd k.

15 Expanding Subspace Theorem Let {d i } n 1 i=0 be a sequence of nonzero Q-conjugate vectors in Rn. Then for any x 0 R n the sequence {x k } generated according to x k+1 α k = x k + α k d k = g T k d k d T k Qd k has the property that f (x) = 1 2 x T Qx b T x attains its minimum value on the affine set x 0 + Span {d 0,..., d k } at the point x k.

16 Expanding Subspace Theorem: Proof Let x solve min {f (x) x x 0 + Span {d 0,..., d k }}. The Subspace Optimality Theorem tells us that f ( x) T d i = 0, i = 0,..., k.

17 Expanding Subspace Theorem: Proof Let x solve min {f (x) x x 0 + Span {d 0,..., d k }}. The Subspace Optimality Theorem tells us that f ( x) T d i = 0, i = 0,..., k. Since x x 0 + Span {d 0,..., d k }, there exist β i R such that x = x 0 + β 0 d 0 + β 2 d β k d k.

18 Expanding Subspace Theorem: Proof Let x solve min {f (x) x x 0 + Span {d 0,..., d k }}. The Subspace Optimality Theorem tells us that f ( x) T d i = 0, i = 0,..., k. Since x x 0 + Span {d 0,..., d k }, there exist β i R such that Therefore, 0 = f ( x) T d i x = x 0 + β 0 d 0 + β 2 d β k d k. = (Q(x 0 + β 0 d 0 + β 2 d β k d k 1 ) + g) T d i = (Qx 0 + g) T d i + β 0 d T 0 Qd i + β 2 d T 2 Qd i + + β k d T k Qd i = f (x 0 ) T d i + β i d T i Qd i.

19 Expanding Subspace Theorem: Proof Hence, 0 = f (x 0 ) T d i + β i d T i Qd i, i = 0, 1,..., k. β i = f (x 0 ) T d i /d T i Qd i, i = 0,..., k.

20 Expanding Subspace Theorem: Proof 0 = f (x 0 ) T d i + β i di T Qd i, i = 0, 1,..., k. Hence, β i = f (x 0 ) T d i /di T Qd i, i = 0,..., k. Similarly, the iteration gives x i+1 α i = x i + α i d i = f (x i ) T d i di T Qd i x k+1 = x 0 + α 0 d 0 + α 2 d α k d k.

21 Expanding Subspace Theorem: Proof 0 = f (x 0 ) T d i + β i di T Qd i, i = 0, 1,..., k. Hence, β i = f (x 0 ) T d i /di T Qd i, i = 0,..., k. Similarly, the iteration gives So we need to show x i+1 α i = x i + α i d i = f (x i ) T d i di T Qd i x k+1 = x 0 + α 0 d 0 + α 2 d α k d k. β i = f (x 0) T d i di T = f (xi)t d i Qd i di T = β i, i = 0,..., k. Qd i

22 Expanding Subspace Theorem: Proof f (x i ) T d i = (Q(x 0 + α 0 d α i 1 d i 1 ) + g) T d i = (Qx 0 + g) T d i + α 0 d0 T Qd i + + α i 1 di 1Qd T i = f (x 0 ) T d i

23 Initialization: x 0 R n, d 0 = g 0 = f (x 0 ) = b Qx 0. For k = 0, 1, 2,... α k x k+1 := gk T d k/dk T Qd k := x k + α k d k g k+1 := Qx k+1 b (STOP if g k+1 = 0) β k d k+1 := gk+1 T Qd k/dk T Qd k := g k+1 + β k d k k := k + 1.

24 The Conjugate Gradient Theorem The C-G algorithm is a conjugate direction method. If it does not terminate at x k (g k 0), then 1. Span [g 0, g 1,..., g k ] = span [g 0, Qg 0,..., Q k g 0 ] 2. Span [d 0, d 1,..., d k ] = span [g 0, Qg 0,..., Q k g 0 ] 3. d T k Qd i = 0 for i k 1 4. α k = g T k g k/d T k Qd k 5. β k = g T k+1 g k+1/g T k g k.

25 The Conjugate Gradient Theorem: Proof Prove (1)-(3) by induction. (1)-(3) true for k = 0. Suppose true up to k and show true for k + 1. (1) Span [g 0, g 1,..., g k ] = Span [g 0, Qg 0,..., Q k g 0 ]

26 The Conjugate Gradient Theorem: Proof Prove (1)-(3) by induction. (1)-(3) true for k = 0. Suppose true up to k and show true for k + 1. (1) Span [g 0, g 1,..., g k ] = Span [g 0, Qg 0,..., Q k g 0 ] Since g k+1 = g k + α k Qd k, g k+1 Span[g 0,..., Q k+1 g 0 ] (ind. hyp.).

27 The Conjugate Gradient Theorem: Proof Prove (1)-(3) by induction. (1)-(3) true for k = 0. Suppose true up to k and show true for k + 1. (1) Span [g 0, g 1,..., g k ] = Span [g 0, Qg 0,..., Q k g 0 ] Since g k+1 = g k + α k Qd k, g k+1 Span[g 0,..., Q k+1 g 0 ] (ind. hyp.). Also g k+1 / Span [d 0,..., d k ] otherwise g k+1 = 0 (by The Subspace Optimality Theorem) since the method is a conjugate direction method up to step k (ind. hyp.). So g k+1 / Span [g 0,..., Q k g 0 ] and Span [g 0, g 1,..., g k+1 ] = Span [g 0,..., Q k+1 g 0 ] proving (1).

28 The Conjugate Gradient Theorem: Proof (2) Span [d 0, d 1,..., d k ] = Span [g 0, Qg 0,..., Q k g 0 ]

29 The Conjugate Gradient Theorem: Proof (2) Span [d 0, d 1,..., d k ] = Span [g 0, Qg 0,..., Q k g 0 ] To prove (2) write d k+1 = g k+1 + β k d k so that (2) follows from (1) and the induction hypothesis on (2).

30 The Conjugate Gradient Theorem: Proof (3) d T k Qd i = 0 for i k 1

31 The Conjugate Gradient Theorem: Proof (3) dk T Qd i = 0 for i k 1 To see (3) observe that dk+1 T Qd i = g k+1 Qd i + β k dk T Qd i. For i = k the right hand side is zero by the definition of β k.

32 The Conjugate Gradient Theorem: Proof (3) d T k Qd i = 0 for i k 1 To see (3) observe that d T k+1 Qd i = g k+1 Qd i + β k d T k Qd i. For i = k the right hand side is zero by the definition of β k. For i < k both terms vanish. The term g T k+1 Qd i = 0 by the Expanding Subspace Theorem since Qd i Span[d 0,..., d k ] by (1) and (2). The term d T k Qd i vanishes by the induction hypothesis on (3).

33 The Conjugate Gradient Theorem: Proof (4) α k = g T k g k/d T k Qd k

34 The Conjugate Gradient Theorem: Proof (4) α k = g T k g k/d T k Qd k g T k d k = g T k g k β k 1 g T k d k 1 where g T k d k 1 = 0 by the Expanding Subspace Theorem. So α k = g T k d k/d T k Qd k = g T k g k/d T k Qd k.

35 The Conjugate Gradient Theorem: Proof (5) β k = g T k+1 g k+1/g T k g k

36 The Conjugate Gradient Theorem: Proof (5) β k = g T k+1 g k+1/g T k g k g T k+1 g k = 0 by the Expanding Subspace Theorem because g k Span[d 0,..., d k ].

37 The Conjugate Gradient Theorem: Proof (5) β k = g T k+1 g k+1/g T k g k g T k+1 g k = 0 by the Expanding Subspace Theorem because g k Span[d 0,..., d k ]. Hence gk+1 T Qd k = gk+1 T Q(x k+1 x k ) = 1 gk+1 T α k α [g k+1 g k ] = 1 gk+1 T k α g k+1. k

38 The Conjugate Gradient Theorem: Proof (5) β k = g T k+1 g k+1/g T k g k g T k+1 g k = 0 by the Expanding Subspace Theorem because g k Span[d 0,..., d k ]. Hence gk+1 T Qd k = gk+1 T Q(x k+1 x k ) = 1 gk+1 T α k α [g k+1 g k ] = 1 gk+1 T k α g k+1. k Therefore, β k = g T k+1 Qd k d T k Qd k = 1 α k g T k+1 g k+1 d T k Qd k = g k+1 T g k+1 gk T g. k

39 Comments on the CG Algorithm The C G method decribed above is a descent method since the values f (x 0 ), f (x 1 ),..., f (x n ) form a decreasing sequence. Moreover, note that f (x k ) T d k = g T k g k and α k > 0. Thus, the C G method behaves very much like the descent methods discussed peviously.

40 Comments on the CG Algorithm Due to the occurrence of round-off error the C-G algorithm is best implemented as an iterative method. That is, at the end of n steps, f may not attain its global minimum at x n and the intervening directions d k may not be Q-conjugate. But it is also possible for the CG algorithm to terminate early.

41 Comments on the CG Algorithm Due to the occurrence of round-off error the C-G algorithm is best implemented as an iterative method. That is, at the end of n steps, f may not attain its global minimum at x n and the intervening directions d k may not be Q-conjugate. But it is also possible for the CG algorithm to terminate early. Consequently, at each step one should check the value f (x k+1 ) and the size of the step x k+1 x k. If either is sufficiently small, then accept x k as the point at which f attains its global minimum value; otherwise, continue to iterate regardless of the iteration count (up to a maximum acceptable number of iterations). Since CG is a descent method, continued progress is assured.

42 The Non-Quadratic CG Algorithm Initialization: x 0 R n, g 0 = f (x 0 ), d 0 = g 0, 0 < c < β < 1. Having x k otain x k+1 as follows: Check restart criteria. If a restart condition is satisfied, then reset x 0 = x n, g 0 = f (x 0 ), d 0 = g 0 ; otherwise, set { α k λ λ > 0, f (x } k + λd k ) T d β f (x k ) T d k, and f (x k + λd k ) f (x k ) cλ f (x k ) T d k x k+1 := x k + α k d k g k+1 := f (x k+1 ) g T k+1 g k+1 gk β k := T g Fletcher-Reeves { k} max 0, g T k+1 (g k+1 g k ) gk T g Polak-Ribiere k d k+1 := g k+1 + β k d k k := k + 1.

43 The Non-Quadratic CG Algorithm Restart Conditions 1. k = n 2. gk+1 T g k 0.2gk T g k 3. 2gk T g k gk T d k 0.2gk T g k

44 The Non-Quadratic CG Algorithm The Polak-Ribiere update for β k has a demonstrated experimental superiority. One way to see why this might be true is to observe that g T k+1 (g k+1 g k ) α k g T k+1 2 f (x k )d k thereby yielding a better second order approximation. Indeed, the formula for β k in in the quadratic case is precisely α k g T k+1 2 f (x k )d k g T k g k.

Lecture 10: September 26

Lecture 10: September 26 0-725: Optimization Fall 202 Lecture 0: September 26 Lecturer: Barnabas Poczos/Ryan Tibshirani Scribes: Yipei Wang, Zhiguang Huo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These