Towards Fast, Accurate and Reproducible LU Factorization
|
|
- Garry Dalton
- 5 years ago
- Views:
Transcription
1 Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3 Sorbonne Universités, UPMC Univ Paris VI, UMR 7606, LIP6 riakymch@kth.se SCAN 2016, Sept 26-29, 2016 Uppsala, Sverige Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
2 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
3 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B LAPACK FLAME NAG Basic Linear Algebra Subprograms (BLAS) Refer. BLAS MKL, cublas OpenBLAS ATLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
4 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
5 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Use the ExBLAS kernels to construct exact higher-level operations such as the LU factorization Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
6 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
7 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
8 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
9 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
10 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Consequence: results of floating-point computations depend on the order of computation Results computed by performance-optimized parallel floating-point libraries may be often inconsistent: each run returns a different result Reproducibility ability to obtain bit-wise identical results from run-to-run on the same input data on the same or different architectures Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
11 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
12 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
13 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Libraries ReproBLAS: Reproducible BLAS (Demmel, Nguyen, Ahrens) For BLAS-1, GEMV, and GEMM on CPUs RARE-BLAS: Repr. Accur. Rounded and Eff. BLAS (Chohra, Langlois, Parello) For BLAS-1 and GEMV on CPUs Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
14 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
15 Our Multi-Level Reproducible Summation Parallel algorithm with 5-levels Suitable for today s parallel architectures Based on FPE with EFT and Kulisch accumulator Guarantees inf precision bit-wise reproducibility Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
16 Level 1: Filtering Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
17 Level 2 and 3: Scalar Superaccumulator Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
18 Level 4 and 5: Reduction and Rounding Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
19 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
20 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
21 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
22 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
23 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
24 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
25 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Within LU: A := x y T + A. Using FMA correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
26 LU Factorization Ax = b A = LU N U M A L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
27 LU Factorization A = LU q N TRSM M LU GEMM A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
28 LU Factorization A = LU q N TRSM q N M LU GEMM M LU LU LU LU LU LU LU A A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
29 An unblocked LU Factorization i 1 p LU Factorization a T 10 := a T 10U00 1 TRSV α 11 := α 11 a T 10a 01 DOT a T 12 := a T 12 a T 10A 02 GEMV i 1 p A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A partitioning of A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
30 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
31 Parallel Reduction Performance Scaling on Intel Xeon Phi Gacc/s Parallel FP sum Demmel fast TBB deterministic Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8EE + Superacc e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
32 Parallel Reduction Data-Dependent Performance on NVIDIA Tesla K20c Gacc/s n = 67e06 Parallel FP Sum Demmel fast Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc 1e+140 1e+80 1e+1001e+120 1e+60 1e+40 1e+20 Dynamic range Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
33 Dot Product Performance Scaling on NVIDIA Tesla K20c DDOT: α := x T y = N i x i y i Time [secs] Parallel DDOT Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Based on TwoProduct and Reproducible Summation TwoProduct(a, b) 1: r a b 2: s fma(a, b, r) fma(a, b, c) = a b + c e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
34 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy p A x y m := mb + Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
35 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix size [m = n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
36 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix sizes [n, m = 256] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
37 Triangular Solver Matrix Partitioning TRSV wg2 TRSV wg3 GEMV GEMV TRSV wg0 bs GEMV TRSV wg1 bs Partitioning of a lower triangular matrix L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
38 Triangular Solver Performance Scaling on NVIDIA Tesla K20c DTRSV: Ax = b Time [secs] Parallel DTRSV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE4EE + Superacc FPE8EE + Superacc Blocked ExTRSV Based on ExDOT Internal ExGEMV Matrix size [n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
39 LU Factorization Performance Scaling on NVIDIA Tesla K80 Time [secs] Parallel DLU Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc A = LU Matrix size [m = n] a T 10 a T 10U00 1 trsv α 11 α 11 a T 10a 01 dot a T 12 a T 12 a T 10A 02 gemv Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
40 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
41 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
42 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
43 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
44 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Future directions Enhance compute-intensive operations and the LU factorization Cover the other variants of the unblocked LU factorization Application of our implementations in real-world codes Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
45 Thank you for your attention! URL: ExBLAS Status ExBLAS-1: ExSUM a, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... a Routines in blue are already in ExBLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27
Hierarchical Approach for Deriving a Reproducible LU factorization
Hierarchical Approach for Deriving a Reproducible LU factorization Roman Iakymchuk, Stef Graillat, David Defour, Erwin Laure, Enrique Quintana-Ortí To cite this version: Roman Iakymchuk, Stef Graillat,
More informationReproducible, Accurately Rounded and Efficient BLAS
Reproducible, Accurately Rounded and Efficient BLAS Chemseddine Chohra, Philippe Langlois, David Parello To cite this version: Chemseddine Chohra, Philippe Langlois, David Parello. Reproducible, Accurately
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationAccurate Solution of Triangular Linear System
SCAN 2008 Sep. 29 - Oct. 3, 2008, at El Paso, TX, USA Accurate Solution of Triangular Linear System Philippe Langlois DALI at University of Perpignan France Nicolas Louvet INRIA and LIP at ENS Lyon France
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationSolving Triangular Systems More Accurately and Efficiently
17th IMACS World Congress, July 11-15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas
More informationAccurate polynomial evaluation in floating point arithmetic
in floating point arithmetic Université de Perpignan Via Domitia Laboratoire LP2A Équipe de recherche en Informatique DALI MIMS Seminar, February, 10 th 2006 General motivation Provide numerical algorithms
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationImplementation of float float operators on graphic hardware
Implementation of float float operators on graphic hardware Guillaume Da Graça David Defour DALI, Université de Perpignan Outline Why GPU? Floating point arithmetic on GPU Float float format Problem &
More informationComputation of dot products in finite fields with floating-point arithmetic
Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationAn accurate algorithm for evaluating rational functions
An accurate algorithm for evaluating rational functions Stef Graillat To cite this version: Stef Graillat. An accurate algorithm for evaluating rational functions. 2017. HAL Id: hal-01578486
More informationFaithful Horner Algorithm
ICIAM 07, Zurich, Switzerland, 16 20 July 2007 Faithful Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, France http://webdali.univ-perp.fr Ph. Langlois (Université de Perpignan)
More informationEfficient Reproducible Floating Point Summation and BLAS
Efficient Reproducible Floating Point Summation and BLAS Peter Ahrens Hong Diep Nguyen James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More information4ccurate 4lgorithms in Floating Point 4rithmetic
4ccurate 4lgorithms in Floating Point 4rithmetic Philippe LANGLOIS 1 1 DALI Research Team, University of Perpignan, France SCAN 2006 at Duisburg, Germany, 26-29 September 2006 Philippe LANGLOIS (UPVD)
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationarxiv: v1 [cs.dc] 19 Nov 2016
A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí
More informationCompensated Horner scheme in complex floating point arithmetic
Compensated Horner scheme in complex floating point arithmetic Stef Graillat, Valérie Ménissier-Morain UPMC Univ Paris 06, UMR 7606, LIP6 4 place Jussieu, F-75252, Paris cedex 05, France stef.graillat@lip6.fr,valerie.menissier-morain@lip6.fr
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationParallel Reproducible Summation
Parallel Reproducible Summation James Demmel Mathematics Department and CS Division University of California at Berkeley Berkeley, CA 94720 demmel@eecs.berkeley.edu Hong Diep Nguyen EECS Department University
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationReproducible Tall-Skinny QR
Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationAutomatic Verified Numerical Computations for Linear Systems
Automatic Verified Numerical Computations for Linear Systems Katsuhisa Ozaki (Shibaura Institute of Technology) joint work with Takeshi Ogita (Tokyo Woman s Christian University) Shin ichi Oishi (Waseda
More informationHow to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm
How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, DALI Research Team {langlois, nicolas.louvet}@univ-perp.fr
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationA new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang
A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationSolving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing
Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Based on 2016v slides by Eivind Fonn NTNU, IMF February 27. 2017 1 The Poisson problem The Poisson equation is an elliptic partial
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationAccelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction
Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón
More informationNumerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms
Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,
More informationTable 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )
ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More information6 Linear Systems of Equations
6 Linear Systems of Equations Read sections 2.1 2.3, 2.4.1 2.4.5, 2.4.7, 2.7 Review questions 2.1 2.37, 2.43 2.67 6.1 Introduction When numerically solving two-point boundary value problems, the differential
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationCompensated Horner Scheme
Compensated Horner Scheme S. Graillat 1, Philippe Langlois 1, Nicolas Louvet 1 DALI-LP2A Laboratory. Université de Perpignan Via Domitia. firstname.lastname@univ-perp.fr, http://webdali.univ-perp.fr Abstract.
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationA High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem
A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,
More informationThe Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
More informationScientific Computing: An Introductory Survey
Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction
More informationREPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007
REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS ROBERT A VAN DE GEIJN DRAFT October 23, 2007 Abstract The Cholesky factorization operation is used to demonstrate
More informationFast and accurate Bessel function computation
0 Fast and accurate Bessel function computation John Harrison, Intel Corporation ARITH-19 Portland, OR Tue 9th June 2009 (11:00 11:30) 1 Bessel functions and their computation Bessel functions are certain
More informationA Compiler for Linear Algebra Operations
A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationScientific Computing: Dense Linear Systems
Scientific Computing: Dense Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 February 9th, 2012 A. Donev (Courant Institute)
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More informationPopulation annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice
Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Michal Borovský Department of Theoretical Physics and Astrophysics, University of P. J. Šafárik in Košice,
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationReview Questions REVIEW QUESTIONS 71
REVIEW QUESTIONS 71 MATLAB, is [42]. For a comprehensive treatment of error analysis and perturbation theory for linear systems and many other problems in linear algebra, see [126, 241]. An overview of
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationAutomated design of floating-point logarithm functions on integer processors
23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationJames Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev
. Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationHigh-Performance Small-Scale Solvers for Moving Horizon Estimation
Downloaded from orbit.dtu.dk on: Oct 14, 218 High-Performance Small-Scale Solvers for Moving Horizon Estimation Frison, Gianluca; Vukov, Milan ; Poulsen, Niels Kjølstad; Diehl, Moritz ; Jørgensen, John
More informationHydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign
Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA
More informationConvergence of Rump s Method for Inverting Arbitrarily Ill-Conditioned Matrices
published in J. Comput. Appl. Math., 205(1):533 544, 2007 Convergence of Rump s Method for Inverting Arbitrarily Ill-Conditioned Matrices Shin ichi Oishi a,b Kunio Tanabe a Takeshi Ogita b,a Siegfried
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationLecture: Numerical Linear Algebra Background
1/36 Lecture: Numerical Linear Algebra Background http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s and Michael Grant s lecture notes
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationACCURATE FLOATING-POINT SUMMATION IN CUB
ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summe inten OUTLINE Who need accuate floating-point ummation?! Round-off eo: ouce and ecovey A new method fo accuate FP ummation on a GPU Added a a function
More informationTight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu
Tight and rigourous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu SCAN - September 2016 AMPAR CudA Multiple Precision ARithmetic
More informationAlgorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk
More informationExtended precision with a rounding mode toward zero environment. Application on the CELL processor
Extended precision with a rounding mode toward zero environment. Application on the CELL processor Hong Diep Nguyen (hong.diep.nguyen@ens-lyon.fr), Stef Graillat (stef.graillat@lip6.fr), Jean-Luc Lamotte
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationAlgorithms for Accurate, Validated and Fast Polynomial Evaluation
Algorithms for Accurate, Validated and Fast Polynomial Evaluation Stef Graillat, Philippe Langlois, Nicolas Louvet Abstract: We survey a class of algorithms to evaluate polynomials with oating point coecients
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationTensor Contractions with Extended BLAS Kernels on CPU and GPU
Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationPractical Linear Algebra. Class Notes Spring 2009
Practical Linear Algebra Class Notes Spring 9 Robert van de Geijn Maggie Myers Department of Computer Sciences The University of Texas at Austin Austin, TX 787 Draft January 8, Copyright 9 Robert van de
More information