Error Bounds for Iterative Refinement in Three Precisions

Size: px

Start display at page:

Download "Error Bounds for Iterative Refinement in Three Precisions"

Samantha Page
5 years ago
Views:

1 Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018

2 Hardware Support for Multiprecision Computation Use of low precision in machine learning has driven emergence of lowprecision capabilities in hardware: Half precision (FP16) defined as storage format in 008 IEEE standard ARM NEON: SIMD architecture, instructions for 8x16-bit, 4x3-bit, x64-bit AMD Radeon Instinct MI5 GPU, 017: single: 1.3 TFLOPS, half: 4.6 TFLOPS NVIDIA Tesla P100, 016: native ISA support for 16-bit FP arithmetic NVIDIA Tesla V100, 017: tensor cores for half precision; 4x4 matrix multiply in one clock cycle double: 7 TFLOPS, half+tensor: 11 TFLOPS (16x!) Google's Tensor processing unit (TPU): quantizes 3-bit FP computations into 8-bit integer arithmetic Aurora Exascale supercomputer: (01) Expected extensive support for reduced-precision arithmetic (3/16/8-bit) 1

3 Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i )

4 Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization (in precision u) for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u ) (in precision u) (in precision u) "Traditional" (high-precision residual computation) [Wilkinson, 1948] (fixed point), [Moler, 1967] (floating point)

5 Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u) (in precision u) (in precision u) (in precision u) "Fixed-Precision" [Jankowski and Woźniakowski, 1977], [Skeel, 1980], [Higham, 1991]

6 Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization (in precision u 1/ ) for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u) (in precision u) (in precision u) "Low-precision factorization" [Langou et al., 006], [Arioli and Duff, 009], [Hogg and Scott, 010], [Abdelfattah et al., 016]

7 Iterative Refinement in 3 Precisions Existing analyses only support at most two precisions Can we combine the performance benefits of low-precision factorization IR with the accuracy of traditional IR? 3

8 Iterative Refinement in 3 Precisions Existing analyses only support at most two precisions Can we combine the performance benefits of low-precision factorization IR with the accuracy of traditional IR? 3-precision iterative refinement u f = factorization precision, u = working precision, u r = residual precision u f u u r 3

9 Iterative Refinement in 3 Precisions Existing analyses only support at most two precisions Can we combine the performance benefits of low-precision factorization IR with the accuracy of traditional IR? 3-precision iterative refinement u f = factorization precision, u = working precision, u r = residual precision u f u u r New analysis generalizes existing types of IR: Traditional u f = u, u r = u [C. and Higham, SIAM SISC 40(), 018] Fixed precision Lower precision factorization u f = u = u r u f = u = u r (and improves upon existing analyses in some cases) 3

10 Iterative Refinement in 3 Precisions Existing analyses only support at most two precisions Can we combine the performance benefits of low-precision factorization IR with the accuracy of traditional IR? 3-precision iterative refinement u f = factorization precision, u = working precision, u r = residual precision u f u u r New analysis generalizes existing types of IR: Traditional u f = u, u r = u [C. and Higham, SIAM SISC 40(), 018] Fixed precision Lower precision factorization u f = u = u r u f = u = u r (and improves upon existing analyses in some cases) Enables new types of IR: (half, single, double), (half, single, quad), (half, double, quad), etc. 3

11 Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i 4

12 Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i Define μ i : A(x x i ) = μ i A x x i 4

13 Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i Define μ i : A(x x i ) = μ i A x x i For a stable refinement scheme, in early stages we expect r i u x x i A x i x μ i 1 4

14 Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i Define μ i : A(x x i ) = μ i A x x i For a stable refinement scheme, in early stages we expect r i u x x i A x i x μ i 1 But close to convergence, r i A x x i μ i 1 4

15 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) 5

16 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] 5

17 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] () r i σ n+1 k μ i P k r i σ 1 5

18 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] () r i σ n+1 k μ i P k r i σ 1 μ i 1 if r i contains significant component in span(u k ) for any k s.t. σ n+1 k σ n 5

19 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] () r i σ n+1 k μ i P k r i σ 1 μ i 1 if r i contains significant component in span(u k ) for any k s.t. σ n+1 k σ n Expect μ i 1 when r i is "typical", i.e., contains sizeable components in the direction of each left singular vector In that case, x x i is not "typical", i.e., it contains large components in right singular vectors corresponding to small singular values of A 5

20 Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] () r i σ n+1 k μ i P k r i σ 1 μ i 1 if r i contains significant component in span(u k ) for any k s.t. σ n+1 k σ n Expect μ i 1 when r i is "typical", i.e., contains sizeable components in the direction of each left singular vector In that case, x x i is not "typical", i.e., it contains large components in right singular vectors corresponding to small singular values of A Wilkinson (1977), comment in unpublished manuscript: μ i () increases with i 5

21 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f 6

22 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f example: LU solve: u s = u f 6

23 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: example: LU solve: u s = u f 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1 6

24 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1 example: LU solve: u s = u f u s E i 3nu f A 1 L U 6

25 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1 example: LU solve: u s = u f u s E i 3nu f A 1 L U. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s 6

26 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 6

27 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s 6

28 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s u s G i 3nu f L U 6

29 Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: example: LU solve: u s = u f 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s u s G i 3nu f L U E i, c 1, c, and G i depend on A, r i, n, and u s 6

30 Forward Error for IR3 Three precisions: u f : factorization precision u: working precision u r : residual computation precision κ A = A 1 A cond(a) = A 1 A cond(a, x) = A 1 A x / x 7

31 Forward Error for IR3 Three precisions: u f : factorization precision u: working precision u r : residual computation precision κ A = A 1 A cond(a) = A 1 A cond(a, x) = A 1 A x / x Theorem [C. and Higham, SISC 40(), 018] For IR in precisions u f u u r and effective solve precision u s, if φ i u s min cond A, κ A μ i + u s E i is sufficiently less than 1, then the forward error is reduced on the ith iteration by a factor φ i until an iterate x i is produced for which x x i x 4Nu r cond A, x + u, where N is the maximum number of nonzeros per row in A. 7

32 Forward Error for IR3 Three precisions: u f : factorization precision u: working precision u r : residual computation precision κ A = A 1 A cond(a) = A 1 A cond(a, x) = A 1 A x / x Theorem [C. and Higham, SISC 40(), 018] For IR in precisions u f u u r and effective solve precision u s, if φ i u s min cond A, κ A μ i + u s E i is sufficiently less than 1, then the forward error is reduced on the ith iteration by a factor φ i until an iterate x i is produced for which x x i x 4Nu r cond A, x + u, where N is the maximum number of nonzeros per row in A. Analogous traditional bounds: φ i 3nu f κ A 7

33 Normwise Backward Error for IR3 Theorem [C. and Higham, SISC 40(), 018] For IR in precisions u f u u r and effective solve precision u s, if φ i c 1 κ A + c u s is sufficiently less than 1, then the residual is reduced on the ith iteration by a factor φ i until an iterate x i is produced for which b A x i Nu b + A x i, where N is the maximum number of nonzeros per row in A. 8

34 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q

35 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. LP fact. LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q

36 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. LP fact. Fixed LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q

37 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. LP fact. Fixed Trad. LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q

38 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. New LP fact. New Fixed Trad. LP fact. New Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q

39 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. New LP fact. New Fixed Trad. LP fact. New Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q Benefit of IR3 vs. "LP fact.": no cond(a, x) term in forward error 9

40 IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad LP fact. New LP fact. New Fixed Trad. LP fact. New Backward error u f u u r max κ (A) norm comp Forward error H S S cond A, x 10 8 H S D H D D cond A, x H D Q S S S cond A, x 10 8 S S D S D D cond A, x S D Q Benefit of IR3 vs. traditional IR: As long as κ A 10 4, can use lower precision factorization w/no loss of accuracy! 9

41 A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : single, u: double, u r : double

42 A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : single, u: double, u r : quad

43 A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : single, u: double, u r : quad

44 A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : double, u: double, u r : quad

45 GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, 11

46 GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, GMRES-IR [C. and Higham, SISC 39(6), 017] A r i To compute the updates d i, apply GMRES to U 1 L 1 Ad i = U 1 L 1 r i 11

47 GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, GMRES-IR [C. and Higham, SISC 39(6), 017] A r i To compute the updates d i, apply GMRES to U 1 L 1 Ad i = U 1 L 1 r i Solve Ax 0 = b by LU factorization for i = 0: maxit r i = b Ax i Solve Ad i = r i via GMRES on Ad i = x i+1 = x i + d i r i 11

48 GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, GMRES-IR [C. and Higham, SISC 39(6), 017] A r i To compute the updates d i, apply GMRES to U 1 L 1 Ad i = U 1 L 1 r i Solve Ax 0 = b by LU factorization for i = 0: maxit u s = u r i = b Ax i Solve Ad i = r i via GMRES on Ad i = x i+1 = x i + d i r i 11

49 A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9, κ A e4 GMRES-IR with u f : single, u: double, u r : quad

50 GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D GMRES-IR H S D LU-IR S D Q GMRES-IR S D Q LU-IR H D Q GMRES-IR H D Q

51 GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D GMRES-IR H S D LU-IR S D Q GMRES-IR S D Q LU-IR H D Q GMRES-IR H D Q With GMRES-IR, lower precision factorization will work for higher κ (A) 13

52 GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D GMRES-IR H S D LU-IR S D Q GMRES-IR S D Q LU-IR H D Q GMRES-IR H D Q With GMRES-IR, lower precision factorization will work for higher κ (A) κ A u 1 1 u f 13

53 GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D GMRES-IR H S D LU-IR S D Q GMRES-IR S D Q LU-IR H D Q GMRES-IR H D Q If κ A 10 1, can use lower precision factorization w/no loss of accuracy! 13

54 GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D GMRES-IR H S D LU-IR S D Q GMRES-IR S D Q LU-IR H D Q GMRES-IR H D Q Try IR3! MATLAB codes available at: Performance results? Stay tuned for the next talk by Azzam Haidar; 3-precision approach on NVIDIA V100 13

55 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps 14

56 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? 14

57 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? If A is ill conditioned and LU factorization is performed in very low precision, it can be a poor preconditioner e.g., if A still has cluster of eigenvalues near origin, GMRES can stagnate until n th iteration, regardless of κ (A) [Liesen and Tichý, 004] 14

58 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? If A is ill conditioned and LU factorization is performed in very low precision, it can be a poor preconditioner e.g., if A still has cluster of eigenvalues near origin, GMRES can stagnate until n th iteration, regardless of κ (A) [Liesen and Tichý, 004] Potential remedies: deflation, Krylov subspace recycling, using additional preconditioner 14

59 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? If A is ill conditioned and LU factorization is performed in very low precision, it can be a poor preconditioner e.g., if A still has cluster of eigenvalues near origin, GMRES can stagnate until n th iteration, regardless of κ (A) [Liesen and Tichý, 004] Potential remedies: deflation, Krylov subspace recycling, using additional preconditioner Depending on conditioning of A, applying A to a vector must be done accurately (precision u ) in each GMRES iteration 14

60 Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? If A is ill conditioned and LU factorization is performed in very low precision, it can be a poor preconditioner e.g., if A still has cluster of eigenvalues near origin, GMRES can stagnate until n th iteration, regardless of κ (A) [Liesen and Tichý, 004] Potential remedies: deflation, Krylov subspace recycling, using additional preconditioner Depending on conditioning of A, applying A to a vector must be done accurately (precision u ) in each GMRES iteration Why GMRES? Theoretical purposes: existing analysis and proof of backward stability [Paige, Rozložník, Strakoš, 006] In practice, use any solver you want! 14

61 Thank You! math.nyu.edu/~erinc

Inexactness and flexibility in linear Krylov solvers

Inexactness and flexibility in linear Krylov solvers Luc Giraud ENSEEIHT (N7) - IRIT, Toulouse Matrix Analysis and Applications CIRM Luminy - October 15-19, 2007 in honor of Gérard Meurant for his 60 th