Towards Fast, Accurate and Reproducible LU Factorization

Size: px

Start display at page:

Download "Towards Fast, Accurate and Reproducible LU Factorization"

Garry Dalton
5 years ago
Views:

Towards Fast, Accurate and Reproducible LU Factorization Roman

Institute of Technology, CSC, CST/PDC 2 Université de Perpignan,

1 Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3 Sorbonne Universités, UPMC Univ Paris VI, UMR 7606, LIP6 riakymch@kth.se SCAN 2016, Sept 26-29, 2016 Uppsala, Sverige Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

2 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

3 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B LAPACK FLAME NAG Basic Linear Algebra Subprograms (BLAS) Refer. BLAS MKL, cublas OpenBLAS ATLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

4 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

5 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Use the ExBLAS kernels to construct exact higher-level operations such as the LU factorization Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

6 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

7 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

8 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

9 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

10 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Consequence: results of floating-point computations depend on the order of computation Results computed by performance-optimized parallel floating-point libraries may be often inconsistent: each run returns a different result Reproducibility ability to obtain bit-wise identical results from run-to-run on the same input data on the same or different architectures Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

11 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

12 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

13 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Libraries ReproBLAS: Reproducible BLAS (Demmel, Nguyen, Ahrens) For BLAS-1, GEMV, and GEMM on CPUs RARE-BLAS: Repr. Accur. Rounded and Eff. BLAS (Chohra, Langlois, Parello) For BLAS-1 and GEMV on CPUs Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

14 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

15 Our Multi-Level Reproducible Summation Parallel algorithm with 5-levels Suitable for today s parallel architectures Based on FPE with EFT and Kulisch accumulator Guarantees inf precision bit-wise reproducibility Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

16 Level 1: Filtering Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

17 Level 2 and 3: Scalar Superaccumulator Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

18 Level 4 and 5: Reduction and Rounding Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

19 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

20 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

21 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

22 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

23 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

24 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

25 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Within LU: A := x y T + A. Using FMA correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

26 LU Factorization Ax = b A = LU N U M A L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

27 LU Factorization A = LU q N TRSM M LU GEMM A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

28 LU Factorization A = LU q N TRSM q N M LU GEMM M LU LU LU LU LU LU LU A A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

29 An unblocked LU Factorization i 1 p LU Factorization a T 10 := a T 10U00 1 TRSV α 11 := α 11 a T 10a 01 DOT a T 12 := a T 12 a T 10A 02 GEMV i 1 p A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A partitioning of A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

30 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

31 Parallel Reduction Performance Scaling on Intel Xeon Phi Gacc/s Parallel FP sum Demmel fast TBB deterministic Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8EE + Superacc e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

32 Parallel Reduction Data-Dependent Performance on NVIDIA Tesla K20c Gacc/s n = 67e06 Parallel FP Sum Demmel fast Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc 1e+140 1e+80 1e+1001e+120 1e+60 1e+40 1e+20 Dynamic range Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

33 Dot Product Performance Scaling on NVIDIA Tesla K20c DDOT: α := x T y = N i x i y i Time [secs] Parallel DDOT Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Based on TwoProduct and Reproducible Summation TwoProduct(a, b) 1: r a b 2: s fma(a, b, r) fma(a, b, c) = a b + c e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

34 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy p A x y m := mb + Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

35 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix size [m = n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

36 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix sizes [n, m = 256] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

37 Triangular Solver Matrix Partitioning TRSV wg2 TRSV wg3 GEMV GEMV TRSV wg0 bs GEMV TRSV wg1 bs Partitioning of a lower triangular matrix L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

38 Triangular Solver Performance Scaling on NVIDIA Tesla K20c DTRSV: Ax = b Time [secs] Parallel DTRSV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE4EE + Superacc FPE8EE + Superacc Blocked ExTRSV Based on ExDOT Internal ExGEMV Matrix size [n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

39 LU Factorization Performance Scaling on NVIDIA Tesla K80 Time [secs] Parallel DLU Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc A = LU Matrix size [m = n] a T 10 a T 10U00 1 trsv α 11 α 11 a T 10a 01 dot a T 12 a T 12 a T 10A 02 gemv Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

40 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

41 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

42 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

43 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

44 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Future directions Enhance compute-intensive operations and the LU factorization Cover the other variants of the unblocked LU factorization Application of our implementations in real-world codes Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

45 Thank you for your attention! URL: ExBLAS Status ExBLAS-1: ExSUM a, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... a Routines in blue are already in ExBLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

Hierarchical Approach for Deriving a Reproducible LU factorization

Hierarchical Approach for Deriving a Reproducible LU factorization Roman Iakymchuk, Stef Graillat, David Defour, Erwin Laure, Enrique Quintana-Ortí To cite this version: Roman Iakymchuk, Stef Graillat,