Error Bounds for Iterative Refinement in Three Precisions

Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018

Hardware Support for Multiprecision Computation Use of low precision in machine learning has driven emergence of lowprecision capabilities in hardware: Half precision (FP16) defined as storage format in 008 IEEE standard ARM NEON: SIMD architecture, instructions for 8x16-bit, 4x3-bit, x64-bit AMD Radeon Instinct MI5 GPU, 017: single: 1.3 TFLOPS, half: 4.6 TFLOPS NVIDIA Tesla P100, 016: native ISA support for 16-bit FP arithmetic NVIDIA Tesla V100, 017: tensor cores for half precision; 4x4 matrix multiply in one clock cycle double: 7 TFLOPS, half+tensor: 11 TFLOPS (16x!) Google's Tensor processing unit (TPU): quantizes 3-bit FP computations into 8-bit integer arithmetic Aurora Exascale supercomputer: (01) Expected extensive support for reduced-precision arithmetic (3/16/8-bit) 1

Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization (in precision u) for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u ) (in precision u) (in precision u) "Traditional" (high-precision residual computation) [Wilkinson, 1948] (fixed point), [Moler, 1967] (floating point)

Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u) (in precision u) (in precision u) (in precision u) "Fixed-Precision" [Jankowski and Woźniakowski, 1977], [Skeel, 1980], [Higham, 1991]

Iterative Refinement for Ax = b Iterative refinement: well-established method for improving an approximate solution to Ax = b A is n n and nonsingular; u is unit roundoff Solve Ax 0 = b by LU factorization (in precision u 1/ ) for i = 0: maxit r i = b Ax i Solve Ad i = r i x i+1 = x i + d i via d i = U 1 (L 1 r i ) (in precision u) (in precision u) (in precision u) "Low-precision factorization" [Langou et al., 006], [Arioli and Duff, 009], [Hogg and Scott, 010], [Abdelfattah et al., 016]

Iterative Refinement in 3 Precisions Existing analyses only support at most two precisions Can we combine the performance benefits of low-precision factorization IR with the accuracy of traditional IR? 3-precision iterative refinement u f = factorization precision, u = working precision, u r = residual precision u f u u r New analysis generalizes existing types of IR: Traditional u f = u, u r = u [C. and Higham, SIAM SISC 40(), 018] Fixed precision Lower precision factorization u f = u = u r u f = u = u r (and improves upon existing analyses in some cases) 3

Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i 4

Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i Define μ i : A(x x i ) = μ i A x x i 4

Key Analysis Innovations I Obtain tighter upper bounds: Typical bounds used in analysis: A(x x i ) A x x i Define μ i : A(x x i ) = μ i A x x i For a stable refinement scheme, in early stages we expect r i u x x i A x i x μ i 1 4

Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) 5

Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] 5

Key Analysis Innovations I r i = μ i () A x x i x x i = VΣ 1 U T r i = n j=1 u j T r i v j σ j (A = UΣV T ) x x i n j=n+1 k u j T r i σ j 1 σ n+1 k n j=n+1 k u j T r i = P kr i σ n+1 k where P k = U k U k T, U k = [u n+1 k,, u n ] () r i σ n+1 k μ i P k r i σ 1 μ i 1 if r i contains significant component in span(u k ) for any k s.t. σ n+1 k σ n Expect μ i 1 when r i is "typical", i.e., contains sizeable components in the direction of each left singular vector In that case, x x i is not "typical", i.e., it contains large components in right singular vectors corresponding to small singular values of A 5

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f example: LU solve: u s = u f 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1 example: LU solve: u s = u f u s E i 3nu f A 1 L U. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s example: LU solve: u s = u f u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s u s G i 3nu f L U 6

Key Analysis Innovations II Allow for general solver: Let u s be the effective precision of the solve, with u u s u f Assume computed solution d i to Ad i = r i satisfies: example: LU solve: u s = u f 1. d i = I + u s E i d i, u s E i < 1 normwise relative forward error is bounded by multiple of u s and is less than 1. r i A d i u s(c 1 A d i + c r i ) normwise relative backward error is at most max c 1, c u s u s E i 3nu f A 1 L U max c 1, c u s 3nu f L U A 3. r i A d i u s G i d i componentwise relative backward error is bounded by a multiple of u s u s G i 3nu f L U E i, c 1, c, and G i depend on A, r i, n, and u s 6

Forward Error for IR3 Three precisions: u f : factorization precision u: working precision u r : residual computation precision κ A = A 1 A cond(a) = A 1 A cond(a, x) = A 1 A x / x 7

Forward Error for IR3 Three precisions: u f : factorization precision u: working precision u r : residual computation precision κ A = A 1 A cond(a) = A 1 A cond(a, x) = A 1 A x / x Theorem [C. and Higham, SISC 40(), 018] For IR in precisions u f u u r and effective solve precision u s, if φ i u s min cond A, κ A μ i + u s E i is sufficiently less than 1, then the forward error is reduced on the ith iteration by a factor φ i until an iterate x i is produced for which x x i x 4Nu r cond A, x + u, where N is the maximum number of nonzeros per row in A. 7

Normwise Backward Error for IR3 Theorem [C. and Higham, SISC 40(), 018] For IR in precisions u f u u r and effective solve precision u s, if φ i c 1 κ A + c u s is sufficiently less than 1, then the residual is reduced on the ith iteration by a factor φ i until an iterate x i is produced for which b A x i Nu b + A x i, where N is the maximum number of nonzeros per row in A. 8

IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad 10 34 Backward error u f u u r max κ (A) norm comp Forward error H S S 10 4 10 8 10 8 cond A, x 10 8 H S D 10 4 10 8 10 8 10 8 H D D 10 4 10 16 10 16 cond A, x 10 16 H D Q 10 4 10 16 10 16 10 16 S S S 10 8 10 8 10 8 cond A, x 10 8 S S D 10 8 10 8 10 8 10 8 S D D 10 8 10 16 10 16 cond A, x 10 16 S D Q 10 8 10 16 10 16 10 16 9

IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad 10 34 LP fact. LP fact. LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S 10 4 10 8 10 8 cond A, x 10 8 H S D 10 4 10 8 10 8 10 8 H D D 10 4 10 16 10 16 cond A, x 10 16 H D Q 10 4 10 16 10 16 10 16 S S S 10 8 10 8 10 8 cond A, x 10 8 S S D 10 8 10 8 10 8 10 8 S D D 10 8 10 16 10 16 cond A, x 10 16 S D Q 10 8 10 16 10 16 10 16 9

IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad 10 34 LP fact. LP fact. Fixed LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S 10 4 10 8 10 8 cond A, x 10 8 H S D 10 4 10 8 10 8 10 8 H D D 10 4 10 16 10 16 cond A, x 10 16 H D Q 10 4 10 16 10 16 10 16 S S S 10 8 10 8 10 8 cond A, x 10 8 S S D 10 8 10 8 10 8 10 8 S D D 10 8 10 16 10 16 cond A, x 10 16 S D Q 10 8 10 16 10 16 10 16 9

IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad 10 34 LP fact. LP fact. Fixed Trad. LP fact. Backward error u f u u r max κ (A) norm comp Forward error H S S 10 4 10 8 10 8 cond A, x 10 8 H S D 10 4 10 8 10 8 10 8 H D D 10 4 10 16 10 16 cond A, x 10 16 H D Q 10 4 10 16 10 16 10 16 S S S 10 8 10 8 10 8 cond A, x 10 8 S S D 10 8 10 8 10 8 10 8 S D D 10 8 10 16 10 16 cond A, x 10 16 S D Q 10 8 10 16 10 16 10 16 9

IR3: Summary Standard (LU-based) IR in three precisions (u s = u f ) Half 10 4, Single 10 8, Double 10 16, Quad 10 34 LP fact. New LP fact. New Fixed Trad. LP fact. New Backward error u f u u r max κ (A) norm comp Forward error H S S 10 4 10 8 10 8 cond A, x 10 8 H S D 10 4 10 8 10 8 10 8 H D D 10 4 10 16 10 16 cond A, x 10 16 H D Q 10 4 10 16 10 16 10 16 S S S 10 8 10 8 10 8 cond A, x 10 8 S S D 10 8 10 8 10 8 10 8 S D D 10 8 10 16 10 16 cond A, x 10 16 S D Q 10 8 10 16 10 16 10 16 9

A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : single, u: double, u r : double 10 0 10

A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : single, u: double, u r : quad 10 0 10

A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9 Standard (LU-based) IR with u f : double, u: double, u r : quad 10 0 10

GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, 11

GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, GMRES-IR [C. and Higham, SISC 39(6), 017] A r i To compute the updates d i, apply GMRES to U 1 L 1 Ad i = U 1 L 1 r i Solve Ax 0 = b by LU factorization for i = 0: maxit r i = b Ax i Solve Ad i = r i via GMRES on Ad i = x i+1 = x i + d i r i 11

GMRES-Based Iterative Refinement Observation [Rump, 1990]: if L and U are computed LU factors of A in precision u f, then even if κ A u f 1. κ U 1 L 1 A 1 + κ A u f, GMRES-IR [C. and Higham, SISC 39(6), 017] A r i To compute the updates d i, apply GMRES to U 1 L 1 Ad i = U 1 L 1 r i Solve Ax 0 = b by LU factorization for i = 0: maxit u s = u r i = b Ax i Solve Ad i = r i via GMRES on Ad i = x i+1 = x i + d i r i 11

A = gallery('randsvd', 100, 1e9, ) b = randn(100,1) κ A e10, cond A, x 5e9, κ A e4 GMRES-IR with u f : single, u: double, u r : quad 10 0 1

GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D 10 4 10 8 10 8 10 8 GMRES-IR H S D 10 8 10 8 10 8 10 8 LU-IR S D Q 10 8 10 16 10 16 10 16 GMRES-IR S D Q 10 16 10 16 10 16 10 16 LU-IR H D Q 10 4 10 16 10 16 10 16 GMRES-IR H D Q 10 1 10 16 10 16 10 16 With GMRES-IR, lower precision factorization will work for higher κ (A) 13

GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D 10 4 10 8 10 8 10 8 GMRES-IR H S D 10 8 10 8 10 8 10 8 LU-IR S D Q 10 8 10 16 10 16 10 16 GMRES-IR S D Q 10 16 10 16 10 16 10 16 LU-IR H D Q 10 4 10 16 10 16 10 16 GMRES-IR H D Q 10 1 10 16 10 16 10 16 If κ A 10 1, can use lower precision factorization w/no loss of accuracy! 13

GMRES-IR: Summary Benefits of GMRES-IR: Backward error u f u u r max κ (A) norm comp Forward error LU-IR H S D 10 4 10 8 10 8 10 8 GMRES-IR H S D 10 8 10 8 10 8 10 8 LU-IR S D Q 10 8 10 16 10 16 10 16 GMRES-IR S D Q 10 16 10 16 10 16 10 16 LU-IR H D Q 10 4 10 16 10 16 10 16 GMRES-IR H D Q 10 1 10 16 10 16 10 16 Try IR3! MATLAB codes available at: https://github.com/eccarson/ir3 Performance results? Stay tuned for the next talk by Azzam Haidar; 3-precision approach on NVIDIA V100 13

Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps 14

Comments and Caveats Convergence tolerance τ for GMRES? Smaller τ more GMRES iterations, potentially fewer refinement steps Larger τ fewer GMRES iterations, potentially more refinement steps Convergence rate of GMRES? If A is ill conditioned and LU factorization is performed in very low precision, it can be a poor preconditioner e.g., if A still has cluster of eigenvalues near origin, GMRES can stagnate until n th iteration, regardless of κ (A) [Liesen and Tichý, 004] Potential remedies: deflation, Krylov subspace recycling, using additional preconditioner Depending on conditioning of A, applying A to a vector must be done accurately (precision u ) in each GMRES iteration 14

Thank You! erinc@cims.nyu.edu math.nyu.edu/~erinc