Towards Fast, Accurate and Reproducible LU Factorization

Size: px
Start display at page:

Download "Towards Fast, Accurate and Reproducible LU Factorization"

Transcription

1 Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3 Sorbonne Universités, UPMC Univ Paris VI, UMR 7606, LIP6 riakymch@kth.se SCAN 2016, Sept 26-29, 2016 Uppsala, Sverige Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

2 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

3 Linear Algebra Libraries BLAS-1 [1979]: y := y + αx α R; x, y R n 2/3 α := α + x T y BLAS-2 [1988]: A := A + xy T A R n n ; x, y R n 2 y := A 1 x BLAS-3 [1990]: C := C + AB A, B, C R n n n/2 C := A 1 B LAPACK FLAME NAG Basic Linear Algebra Subprograms (BLAS) Refer. BLAS MKL, cublas OpenBLAS ATLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

4 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

5 Goals To compute BLAS operations with floating-point numbers fast and precise, ensuring their numerical reproducibility, on a wide range of architectures ExBLAS Exact BLAS ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Use the ExBLAS kernels to construct exact higher-level operations such as the LU factorization Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

6 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

7 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

8 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

9 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative in double precision Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

10 Accuracy and Reproducibility Problems Floating-point arithmetic suffers from rounding errors Floating-point operations (+, ) are commutative but non-associative ( 1 + 1) ( ) in double precision Consequence: results of floating-point computations depend on the order of computation Results computed by performance-optimized parallel floating-point libraries may be often inconsistent: each run returns a different result Reproducibility ability to obtain bit-wise identical results from run-to-run on the same input data on the same or different architectures Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

11 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

12 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

13 Existing Solutions Fix the Order of Computations Sequential mode: intolerably costly at large-scale systems Fixed reduction trees: substantial communication overhead Example: Intel Conditional Numerical Reproducibility ( 2x for datum, no accuracy guarantees) Eliminate/Reduce the Rounding Errors Fixed-point arithmetic: limited range of values Fixed FP expansions with Error-Free Transformations (EFT) Example: double-double or quad-double (Briggs, Bailey, Hida, Li) (work well on a set of relatively close numbers) Infinite precision: reproducible independently from the inputs Example: Kulisch accumulator (considered inefficient) Libraries ReproBLAS: Reproducible BLAS (Demmel, Nguyen, Ahrens) For BLAS-1, GEMV, and GEMM on CPUs RARE-BLAS: Repr. Accur. Rounded and Eff. BLAS (Chohra, Langlois, Parello) For BLAS-1 and GEMV on CPUs Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

14 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

15 Our Multi-Level Reproducible Summation Parallel algorithm with 5-levels Suitable for today s parallel architectures Based on FPE with EFT and Kulisch accumulator Guarantees inf precision bit-wise reproducibility Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

16 Level 1: Filtering Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

17 Level 2 and 3: Scalar Superaccumulator Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

18 Level 4 and 5: Reduction and Rounding Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

19 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

20 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

21 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

22 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

23 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

24 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

25 ExBLAS Highlights ExBLAS Status ExBLAS-1: ExSUM, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... ExSCAL x := α x correctly rounded and reproducible Within LU: x := 1/α x not correctly rounded ExInvSCAL: x := x/α correctly rounded and reproducible ExGER General case: A := α x y T + A Within LU: A := x y T + A. Using FMA correctly rounded and reproducible Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

26 LU Factorization Ax = b A = LU N U M A L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

27 LU Factorization A = LU q N TRSM M LU GEMM A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

28 LU Factorization A = LU q N TRSM q N M LU GEMM M LU LU LU LU LU LU LU A A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

29 An unblocked LU Factorization i 1 p LU Factorization a T 10 := a T 10U00 1 TRSV α 11 := α 11 a T 10a 01 DOT a T 12 := a T 12 a T 10A 02 GEMV i 1 p A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A partitioning of A Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

30 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

31 Parallel Reduction Performance Scaling on Intel Xeon Phi Gacc/s Parallel FP sum Demmel fast TBB deterministic Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8EE + Superacc e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

32 Parallel Reduction Data-Dependent Performance on NVIDIA Tesla K20c Gacc/s n = 67e06 Parallel FP Sum Demmel fast Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc 1e+140 1e+80 1e+1001e+120 1e+60 1e+40 1e+20 Dynamic range Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

33 Dot Product Performance Scaling on NVIDIA Tesla K20c DDOT: α := x T y = N i x i y i Time [secs] Parallel DDOT Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Based on TwoProduct and Reproducible Summation TwoProduct(a, b) 1: r a b 2: s fma(a, b, r) fma(a, b, c) = a b + c e+06 1e+07 1e+08 1e+09 Array size Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

34 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy p A x y m := mb + Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

35 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix size [m = n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

36 Matrix-Vector Product Performance Scaling on NVIDIA Tesla K80 DGEMV: y := αax + βy Time [secs] Parallel DGEMV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE8EE + Superacc Matrix sizes [n, m = 256] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

37 Triangular Solver Matrix Partitioning TRSV wg2 TRSV wg3 GEMV GEMV TRSV wg0 bs GEMV TRSV wg1 bs Partitioning of a lower triangular matrix L Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

38 Triangular Solver Performance Scaling on NVIDIA Tesla K20c DTRSV: Ax = b Time [secs] Parallel DTRSV Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc FPE4EE + Superacc FPE8EE + Superacc Blocked ExTRSV Based on ExDOT Internal ExGEMV Matrix size [n] Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

39 LU Factorization Performance Scaling on NVIDIA Tesla K80 Time [secs] Parallel DLU Superacc FPE3 + Superacc FPE4 + Superacc FPE8 + Superacc A = LU Matrix size [m = n] a T 10 a T 10U00 1 trsv α 11 α 11 a T 10a 01 dot a T 12 a T 12 a T 10A 02 gemv Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

40 Outline 1 Accuracy and Reproducibility of FP Operations 2 Exact Parallel Reduction 3 ExBLAS and Reproducible LU 4 Performance Results 5 Conclusions and Future Work Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

41 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

42 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

43 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

44 Conclusions and Future Work Conclusions Compute the results with no errors due to rounding Provide bit-wise reproducible results independently from Data permutation, data assignment Thread scheduling Partitioning/blocking Reduction trees Deliver comparable performance to the classic implementations of the memory-bound operations Reproducible underlying kernels reproducible LU Future directions Enhance compute-intensive operations and the LU factorization Cover the other variants of the unblocked LU factorization Application of our implementations in real-world codes Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

45 Thank you for your attention! URL: ExBLAS Status ExBLAS-1: ExSUM a, ExSCAL, ExDOT, ExAXPY,... ExBLAS-2: ExGER, ExGEMV, ExTRSV, ExSYR,... ExBLAS-3: ExGEMM, ExTRSM, ExSYR2K,... a Routines in blue are already in ExBLAS Roman Iakymchuk (KTH) Towards Reproducible LU Sept 26-29, / 27

Hierarchical Approach for Deriving a Reproducible LU factorization

Hierarchical Approach for Deriving a Reproducible LU factorization Hierarchical Approach for Deriving a Reproducible LU factorization Roman Iakymchuk, Stef Graillat, David Defour, Erwin Laure, Enrique Quintana-Ortí To cite this version: Roman Iakymchuk, Stef Graillat,

More information

Reproducible, Accurately Rounded and Efficient BLAS

Reproducible, Accurately Rounded and Efficient BLAS Reproducible, Accurately Rounded and Efficient BLAS Chemseddine Chohra, Philippe Langlois, David Parello To cite this version: Chemseddine Chohra, Philippe Langlois, David Parello. Reproducible, Accurately

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accurate Solution of Triangular Linear System

Accurate Solution of Triangular Linear System SCAN 2008 Sep. 29 - Oct. 3, 2008, at El Paso, TX, USA Accurate Solution of Triangular Linear System Philippe Langlois DALI at University of Perpignan France Nicolas Louvet INRIA and LIP at ENS Lyon France

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Solving Triangular Systems More Accurately and Efficiently

Solving Triangular Systems More Accurately and Efficiently 17th IMACS World Congress, July 11-15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas

More information

Accurate polynomial evaluation in floating point arithmetic

Accurate polynomial evaluation in floating point arithmetic in floating point arithmetic Université de Perpignan Via Domitia Laboratoire LP2A Équipe de recherche en Informatique DALI MIMS Seminar, February, 10 th 2006 General motivation Provide numerical algorithms

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Implementation of float float operators on graphic hardware

Implementation of float float operators on graphic hardware Implementation of float float operators on graphic hardware Guillaume Da Graça David Defour DALI, Université de Perpignan Outline Why GPU? Floating point arithmetic on GPU Float float format Problem &

More information

Computation of dot products in finite fields with floating-point arithmetic

Computation of dot products in finite fields with floating-point arithmetic Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

An accurate algorithm for evaluating rational functions

An accurate algorithm for evaluating rational functions An accurate algorithm for evaluating rational functions Stef Graillat To cite this version: Stef Graillat. An accurate algorithm for evaluating rational functions. 2017. HAL Id: hal-01578486

More information

Faithful Horner Algorithm

Faithful Horner Algorithm ICIAM 07, Zurich, Switzerland, 16 20 July 2007 Faithful Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, France http://webdali.univ-perp.fr Ph. Langlois (Université de Perpignan)

More information

Efficient Reproducible Floating Point Summation and BLAS

Efficient Reproducible Floating Point Summation and BLAS Efficient Reproducible Floating Point Summation and BLAS Peter Ahrens Hong Diep Nguyen James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

4ccurate 4lgorithms in Floating Point 4rithmetic

4ccurate 4lgorithms in Floating Point 4rithmetic 4ccurate 4lgorithms in Floating Point 4rithmetic Philippe LANGLOIS 1 1 DALI Research Team, University of Perpignan, France SCAN 2006 at Duisburg, Germany, 26-29 September 2006 Philippe LANGLOIS (UPVD)

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

arxiv: v1 [cs.dc] 19 Nov 2016

arxiv: v1 [cs.dc] 19 Nov 2016 A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí

More information

Compensated Horner scheme in complex floating point arithmetic

Compensated Horner scheme in complex floating point arithmetic Compensated Horner scheme in complex floating point arithmetic Stef Graillat, Valérie Ménissier-Morain UPMC Univ Paris 06, UMR 7606, LIP6 4 place Jussieu, F-75252, Paris cedex 05, France stef.graillat@lip6.fr,valerie.menissier-morain@lip6.fr

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Parallel Reproducible Summation

Parallel Reproducible Summation Parallel Reproducible Summation James Demmel Mathematics Department and CS Division University of California at Berkeley Berkeley, CA 94720 demmel@eecs.berkeley.edu Hong Diep Nguyen EECS Department University

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Reproducible Tall-Skinny QR

Reproducible Tall-Skinny QR Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

Automatic Verified Numerical Computations for Linear Systems

Automatic Verified Numerical Computations for Linear Systems Automatic Verified Numerical Computations for Linear Systems Katsuhisa Ozaki (Shibaura Institute of Technology) joint work with Takeshi Ogita (Tokyo Woman s Christian University) Shin ichi Oishi (Waseda

More information

How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm

How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, DALI Research Team {langlois, nicolas.louvet}@univ-perp.fr

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Based on 2016v slides by Eivind Fonn NTNU, IMF February 27. 2017 1 The Poisson problem The Poisson equation is an elliptic partial

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón

More information

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,

More information

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

6 Linear Systems of Equations

6 Linear Systems of Equations 6 Linear Systems of Equations Read sections 2.1 2.3, 2.4.1 2.4.5, 2.4.7, 2.7 Review questions 2.1 2.37, 2.43 2.67 6.1 Introduction When numerically solving two-point boundary value problems, the differential

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

Compensated Horner Scheme

Compensated Horner Scheme Compensated Horner Scheme S. Graillat 1, Philippe Langlois 1, Nicolas Louvet 1 DALI-LP2A Laboratory. Université de Perpignan Via Domitia. firstname.lastname@univ-perp.fr, http://webdali.univ-perp.fr Abstract.

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,

More information

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

More information

REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007

REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007 REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS ROBERT A VAN DE GEIJN DRAFT October 23, 2007 Abstract The Cholesky factorization operation is used to demonstrate

More information

Fast and accurate Bessel function computation

Fast and accurate Bessel function computation 0 Fast and accurate Bessel function computation John Harrison, Intel Corporation ARITH-19 Portland, OR Tue 9th June 2009 (11:00 11:30) 1 Bessel functions and their computation Bessel functions are certain

More information

A Compiler for Linear Algebra Operations

A Compiler for Linear Algebra Operations A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Scientific Computing: Dense Linear Systems

Scientific Computing: Dense Linear Systems Scientific Computing: Dense Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 February 9th, 2012 A. Donev (Courant Institute)

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Michal Borovský Department of Theoretical Physics and Astrophysics, University of P. J. Šafárik in Košice,

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Review Questions REVIEW QUESTIONS 71

Review Questions REVIEW QUESTIONS 71 REVIEW QUESTIONS 71 MATLAB, is [42]. For a comprehensive treatment of error analysis and perturbation theory for linear systems and many other problems in linear algebra, see [126, 241]. An overview of

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

Automated design of floating-point logarithm functions on integer processors

Automated design of floating-point logarithm functions on integer processors 23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev . Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

High-Performance Small-Scale Solvers for Moving Horizon Estimation

High-Performance Small-Scale Solvers for Moving Horizon Estimation Downloaded from orbit.dtu.dk on: Oct 14, 218 High-Performance Small-Scale Solvers for Moving Horizon Estimation Frison, Gianluca; Vukov, Milan ; Poulsen, Niels Kjølstad; Diehl, Moritz ; Jørgensen, John

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

Convergence of Rump s Method for Inverting Arbitrarily Ill-Conditioned Matrices

Convergence of Rump s Method for Inverting Arbitrarily Ill-Conditioned Matrices published in J. Comput. Appl. Math., 205(1):533 544, 2007 Convergence of Rump s Method for Inverting Arbitrarily Ill-Conditioned Matrices Shin ichi Oishi a,b Kunio Tanabe a Takeshi Ogita b,a Siegfried

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Lecture: Numerical Linear Algebra Background

Lecture: Numerical Linear Algebra Background 1/36 Lecture: Numerical Linear Algebra Background http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s and Michael Grant s lecture notes

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

ACCURATE FLOATING-POINT SUMMATION IN CUB

ACCURATE FLOATING-POINT SUMMATION IN CUB ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summe inten OUTLINE Who need accuate floating-point ummation?! Round-off eo: ouce and ecovey A new method fo accuate FP ummation on a GPU Added a a function

More information

Tight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu

Tight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu Tight and rigourous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu SCAN - September 2016 AMPAR CudA Multiple Precision ARithmetic

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Extended precision with a rounding mode toward zero environment. Application on the CELL processor

Extended precision with a rounding mode toward zero environment. Application on the CELL processor Extended precision with a rounding mode toward zero environment. Application on the CELL processor Hong Diep Nguyen (hong.diep.nguyen@ens-lyon.fr), Stef Graillat (stef.graillat@lip6.fr), Jean-Luc Lamotte

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Algorithms for Accurate, Validated and Fast Polynomial Evaluation

Algorithms for Accurate, Validated and Fast Polynomial Evaluation Algorithms for Accurate, Validated and Fast Polynomial Evaluation Stef Graillat, Philippe Langlois, Nicolas Louvet Abstract: We survey a class of algorithms to evaluate polynomials with oating point coecients

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contractions with Extended BLAS Kernels on CPU and GPU Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Practical Linear Algebra. Class Notes Spring 2009

Practical Linear Algebra. Class Notes Spring 2009 Practical Linear Algebra Class Notes Spring 9 Robert van de Geijn Maggie Myers Department of Computer Sciences The University of Texas at Austin Austin, TX 787 Draft January 8, Copyright 9 Robert van de

More information