R-Linear Convergence of Limited Memory Steepest Descent

Size: px

Start display at page:

Download "R-Linear Convergence of Limited Memory Steepest Descent"

Randolph Gibbs
6 years ago
Views:

1 R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence of Limited Memory Steepest Descent 1 of 26

2 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 2 of 26

3 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 3 of 26

4 Unconstrained optimization: Steepest descent Consider the unconstrained optimization problem min f(x), where f : x R n Rn R is C 1. Let us focus exclusively on a steepest descent framework: Algorithm SD Steepest Descent Require: x 1 R n 1: for k N do 2: Compute g k f(x k ) 3: Choose α k (0, ) 4: Set x k+1 x k α k g k 5: end for All that remains to be determined are the stepsizes {α k }. R-Linear Convergence of Limited Memory Steepest Descent 4 of 26

5 Minimizing strongly convex quadratics Suppose f(x) = 1 2 xt Ax b T x, where A has eigenvalues λ 1 λ n. 0 1/λ n 1/λ 1 Convergence (rate) of the algorithm depends on choices for {α k }. R-Linear Convergence of Limited Memory Steepest Descent 5 of 26

6 Minimizing strongly convex quadratics Suppose f(x) = 1 2 xt Ax b T x, where A has eigenvalues λ 1 λ n. 0 1/λ n 1/λ 1 Choosing α k 1/λ n leads to Q-linear convergence with constant (1 λ 1 /λ n) R-Linear Convergence of Limited Memory Steepest Descent 5 of 26

7 Minimizing strongly convex quadratics Suppose f(x) = 1 2 xt Ax b T x, where A has eigenvalues λ 1 λ n. 0 1/λ n 1/λ 1... but certain components of the gradient vanish in a larger range. R-Linear Convergence of Limited Memory Steepest Descent 5 of 26

8 Minimizing strongly convex quadratics Suppose f(x) = 1 2 xt Ax b T x, where A has eigenvalues λ 1 λ n. 0 1/λ n 1/λ 1 Goal: Allow large stepsizes, shrink range (automatically) to catch entire gradient. R-Linear Convergence of Limited Memory Steepest Descent 5 of 26

9 Contributions Consider Fletcher s limited memory steepest descent (LMSD) method. Extends the Barzilai-Borwein (BB) two-point stepsize strategy. BB methods known to have R-linear convergence rate; Dai and Liao (2002). We prove that LMSD also attains R-linear convergence. Although proved convergence rate is not necessarily better than that for BB, one can see reasons for improved empirical performance. R-Linear Convergence of Limited Memory Steepest Descent 6 of 26

10 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 7 of 26

11 Decomposition min f(x), x R n where f(x) = 1 2 xt Ax b T x Let A have the eigendecomposition A = QΛQ T, where Q = [ q 1 q n ] is orthogonal and Λ = diag(λ 1,..., λ n) with λ n λ 1 > 0. R-Linear Convergence of Limited Memory Steepest Descent 8 of 26

12 Decomposition min f(x), x R n where f(x) = 1 2 xt Ax b T x Let A have the eigendecomposition A = QΛQ T, where Q = [ q 1 q n ] is orthogonal and Λ = diag(λ 1,..., λ n) with λ n λ 1 > 0. Let g := f. For any x R n, the gradient of f at x can be expressed as g(x) = n d i q i, where d i R for all i [n] := {1,..., n}. i=1 R-Linear Convergence of Limited Memory Steepest Descent 8 of 26

13 Recursion Let g := f. For any x R n, the gradient of f at x can be expressed as n g(x) = d i q i, where d i R for all i [n] := {1,..., n}. (1) i=1 R-Linear Convergence of Limited Memory Steepest Descent 9 of 26

14 Recursion Let g := f. For any x R n, the gradient of f at x can be expressed as n g(x) = d i q i, where d i R for all i [n] := {1,..., n}. (1) i=1 If x + x αg(x), then the weights satisfy the recursive property: d + i = (1 αλ i )d i for all i [n]. R-Linear Convergence of Limited Memory Steepest Descent 9 of 26

15 Recursion Let g := f. For any x R n, the gradient of f at x can be expressed as n g(x) = d i q i, where d i R for all i [n] := {1,..., n}. (1) i=1 If x + x αg(x), then the weights satisfy the recursive property: d + i = (1 αλ i )d i for all i [n]. Proof (Sketch). Since g(x) = Ax b, x + = x αg(x) Ax + = Ax αg(x) g(x + ) = (I αa)g(x) g(x + ) = (I αqλq T )g(x), then decompose according to (1). R-Linear Convergence of Limited Memory Steepest Descent 9 of 26

16 Recursion Let g := f. For any x R n, the gradient of f at x can be expressed as n g(x) = d i q i, where d i R for all i [n] := {1,..., n}. (1) i=1 If x + x αg(x), then the weights satisfy the recursive property: d + i = (1 αλ i )d i for all i [n]. Proof (Sketch). Since g(x) = Ax b, x + = x αg(x) Ax + = Ax αg(x) g(x + ) = (I αa)g(x) g(x + ) = (I αqλq T )g(x), then decompose according to (1). Idea: Choose stepsizes as reciprocals of (estimates of) eigenvalues of A. R-Linear Convergence of Limited Memory Steepest Descent 9 of 26

17 LMSD method: Main idea Fletcher (2012): Repeated cycles (or sweeps ) of m iterations. At start of (k + 1)st cycle, suppose one has the kth cycle values in G k := [ ] g k,1 g k,m corresponding to {xk,1,..., x k,m }. Iterate displacements lie in Krylov sequence initiated from g k,1. R-Linear Convergence of Limited Memory Steepest Descent 10 of 26

18 LMSD method: Main idea Fletcher (2012): Repeated cycles (or sweeps ) of m iterations. At start of (k + 1)st cycle, suppose one has the kth cycle values in G k := [ ] g k,1 g k,m corresponding to {xk,1,..., x k,m }. Iterate displacements lie in Krylov sequence initiated from g k,1. Performing a QR decomposition to obtain G k = Q k R k, one obtains m eigenvalue estimates (Ritz values) as eigenvalues of (symmetric tridiagonal) T k Q T k AQ k, which are contained in the spectrum of A in an optimal sense (more later). One can also obtain these estimates more cheaply and with less storage... R-Linear Convergence of Limited Memory Steepest Descent 10 of 26

19 LMSD method: Efficient eigenvalue estimation Storing the kth cycle reciprocal stepsizes in α 1 k,1 α J k 1 k,1....,.. α 1 k,m α 1 k,m one finds that by computing the (partially extended) Cholesky factorization G T [ ] [ ] k Gk g k,m+1 = R T k Rk r k, one has T k [ R k r k ] Jk R 1 k. Long story short: One can obtain Ritz values (and stepsizes) in 1 2 m2 n flops... and this is done only once every m steps. R-Linear Convergence of Limited Memory Steepest Descent 11 of 26

20 LMSD Algorithm LMSD Limited Memory Steepest Descent Require: x 1,1 R n, m N, and ɛ R + 1: Choose stepsizes {α 1,j } j [m] R ++ 2: Compute g 1,1 f(x 1,1 ) 3: if g 1,1 ɛ, then return x 1,1 4: for k N do 5: for j [m] do 6: Set x k,j+1 x k,j α k,j g k,j 7: Compute g k,j+1 f(x k,j+1 ) 8: if g k,j+1 ɛ, then return x k,j+1 9: end for 10: Set x k+1,1 x k,m+1 and g k+1,1 g k,m+1 11: Set G k and J k 12: Compute (R k, r k ), then compute T k 13: Compute {θ k,j } j [m] R ++ as the eigenvalues of T k 14: Compute {α k+1,j } j [m] {θ 1 k,j } j [m] R ++ 15: end for (Note: There is also a version using harmonic Ritz values.) R-Linear Convergence of Limited Memory Steepest Descent 12 of 26

21 Known convergence properties BB methods (m = 1): R-superlinear when n = 2; Barzilai and Borwein (1988) Convergent for any n from any starting point; Raydan (1993) R-linear for any n; Dai and Liao (2002) LMSD methods (m 1): Convergent for any n from any starting point; Fletcher (2012) Prior to our work: Convergence rate not yet analyzed. R-Linear Convergence of Limited Memory Steepest Descent 13 of 26

22 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 14 of 26

23 Basic Assumptions Assumption 1 (i) Algorithm LMSD is run with ɛ = 0 and g k,j 0 for all (k, j) N [m]. (ii) For all k N, the matrix G k has linearly independent columns. Further, there exists ρ [1, ) such that, for all k N, To justify (2), note that when m = 1, one has R 1 k ρ g k,1 1. (2) Q k R k = G k = g k,1 where Q k = g k,1 / g k,1 and R k = g k,1. Hence, (2) holds with ρ = 1. R-Linear Convergence of Limited Memory Steepest Descent 15 of 26

24 Intuition Lemma 2 For all k N, the eigenvalues of T k satisfy θ k,j [λ m+1 j, λ n+1 j ] [λ 1, λ n] for all j [m]. Recall /λ n 1/λ 1 We essentially prove that... R-Linear Convergence of Limited Memory Steepest Descent 16 of 26

25 Worst-case blow-up of weights over a cycle Lemma 3 For each (k, j, i) N [m] [n]: { } λ i d k,j+1,i δ j,i d k,j,i where δ j,i := max 1 λ, λ i 1 m+1 j λ. n+1 j Hence, for each (k, j, i) N [m] [n]: d k+1,j,i i d k,j,i where i := Furthermore, for each (k, j, p) N [m] [n]: m δ j,i. j=1 p d 2 k,j+1,i ˆδ j,p p d 2 k,j,i where ˆδ j,p := max δ j,i, i [p] i=1 i=1 while, for each (k, j) N [m]: g k+1,j g k,j where := max i [n] i. R-Linear Convergence of Limited Memory Steepest Descent 17 of 26

26 Q-linear convergence of weight i = 1 Lemma 4 If 1 = 0, then d 1+ˆk,ĵ,1 = 0 for all (ˆk, ĵ) N [m]. Otherwise, if 1 > 0, then: (i) for (k, j) N [m] with d k,j,1 = 0, it follows that d k+ˆk,ĵ,1 = 0 for all (ˆk, ĵ) N [m]; (ii) for (k, j) N [m] with d k,j,1 > 0 and any ɛ 1 (0, 1), it follows that d k+ˆk,ĵ,1 d k,j,1 log ɛ1 ɛ 1 for all ˆk 1 + log 1 and ĵ [m]. R-Linear Convergence of Limited Memory Steepest Descent 18 of 26

27 Ritz value representation Lemma 5 For all (k, j) N [m], let q k,j R m denote the unit eigenvector corresponding to the eigenvalue θ k,j of T k, i.e., that with T k q k,j = θ k,j q k,j and q k,j = 1. Then, defining d k,1,1 d k,m,1 D k := and c k,j := D k R 1 k q k,j, d k,1,n d k,m,n it follows that, with the diagonal matrix of eigenvalues (namely, Λ = Q T AQ), θ k,j = c T k,j Λc k,j and c T k,j c k,j = 1. R-Linear Convergence of Limited Memory Steepest Descent 19 of 26

28 If first p weights small, then bound for weight p (We express ˆδ p [1, ) dependent only on m, p, and the spectrum of A.) Lemma 6 (simplified) For any (k, p) N [n 1], if there exists (ɛ p, K p) (0, p d 2 k+ˆk,1,i ɛ2 p g k,1 2 for all ˆk Kp, i=1 1 2ˆδ pρ ) N with then there exists K p+1 K p dependent only on ɛ p, ρ, and the spectrum of A with d 2 k+k p+1,1,p+1 4ˆδ 2 pρ 2 ɛ 2 p g k,1 2 ; Proof (Key step). First p elements of c k+ˆk,j small enough such that n θ k+ˆk,j = λ i c 2 k+ˆk,j,i 3 4 λ p+1 for ˆk K p and j [m]. i=1 R-Linear Convergence of Limited Memory Steepest Descent 20 of 26

29 If first p weights small, then bound for all first p + 1 weights... Lemma 7 For any (k, p) N [n 1], if there exists (ɛ p, K p) (0, p d 2 k+ˆk,1,i ɛ2 p g k,1 2 for all ˆk Kp, i=1 then, with ɛ 2 p+1 := (1 + 4 max{1, 4 p+1 }ˆδ 2 pρ 2 )ɛ 2 p and K p+1 N, 1 2ˆδ pρ ) N with p+1 d 2 k+ˆk,1,i ɛ2 p+1 g k,1 2 for all ˆk Kp+1. i=1 R-Linear Convergence of Limited Memory Steepest Descent 21 of 26

30 R-linear convergence of LMSD Lemma 8 There exists K N dependent only on the spectrum of A such that g k+k,1 1 2 g k,1 for all k N. Theorem 9 The sequence { g k,1 } vanishes R-linearly in the sense that g k,1 c 1 c k 2 g 1,1, where c 1 := 2 K 1 and c 2 := 2 1/K (0, 1). R-Linear Convergence of Limited Memory Steepest Descent 22 of 26

31 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 23 of 26

32 Numerical demonstrations with n = 100: m = 1 and m = 5 Figure: {λ 1,..., λ 100} [1, 1.9] R-Linear Convergence of Limited Memory Steepest Descent 24 of 26

33 Numerical demonstrations with n = 100: m = 1 and m = 5 Figure: {λ 1,..., λ 100} [1, 100] R-Linear Convergence of Limited Memory Steepest Descent 24 of 26

34 Numerical demonstrations with n = 100: m = 1 and m = 5 Figure: {λ 1,..., λ 100} 5 clusters, m = 5 R-Linear Convergence of Limited Memory Steepest Descent 24 of 26

35 Numerical demonstrations with n = 100: m = 1 and m = 5 Figure: {λ 1,..., λ 100} 2 clusters (low heavy), m = 5 R-Linear Convergence of Limited Memory Steepest Descent 24 of 26

36 Numerical demonstrations with n = 100: m = 1 and m = 5 Figure: {λ 1,..., λ 100} 2 clusters (high heavy), m = 5 R-Linear Convergence of Limited Memory Steepest Descent 24 of 26

37 Outline Introduction Limited Memory Steepest Descent (LMSD) R-Linear Convergence of LMSD Numerical Demonstrations Summary R-Linear Convergence of Limited Memory Steepest Descent 25 of 26

38 Summary Consider Fletcher s limited memory steepest descent (LMSD) method. Extends the Barzilai-Borwein (BB) two-point stepsize strategy. BB methods known to have R-linear convergence rate; Dai and Liao (2002). We prove that LMSD also attains R-linear convergence. Although proved convergence rate is not necessarily better than that for BB, one can see reasons for improved empirical performance. F. E. Curtis and W. Guo. R-Linear Convergence of Limited Memory Steepest Descent. Technical Report 16T-010, COR@L Laboratory, Department of ISE, Lehigh University, Soon in IMA Journal of Numerical Analysis: R-Linear Convergence of Limited Memory Steepest Descent 26 of 26

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent Fran E. Curtis and Wei Guo Department of Industrial and Systems Engineering, Lehigh University, USA COR@L Technical Report 16T-010 R-Linear Convergence