Level-3 BLAS on a GPU

Size: px
Start display at page:

Download "Level-3 BLAS on a GPU"

Transcription

1 Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain 2 Department of Computer Sciences. The University of Texas at Austin Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 1 Igual et al.

2 Motivation (I Motivation GPU vendors promise spectacular peak performances Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 2 Igual et al.

3 Motivation (II Motivation But real performances are not so optimistic CUBLAS peak performance on a T10 Peak performance CUBLAS 2.3 GEMM GFLOPS Matrix Size Limitations of the graphics-oriented architecture Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 3 Igual et al.

4 Motivation (III Motivation And current implementations can be quite poor Performance against CUBLAS on a T10 Peak performance CUBLAS 2.3 GEMM CUBLAS Syr2k CUBLAS Symm 800 GFLOPS Matrix Size Hard to efficiently program the GPU, even using CUDA Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 4 Igual et al.

5 Outline Motivation 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 5 Igual et al.

6 Contents Introduction 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 6 Igual et al.

7 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.

8 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.

9 Development of algorithms by blocks. The matrix-matrix product Contents 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 8 Igual et al.

10 Development of algorithms by blocks. The matrix-matrix product FLAME The FLAME methodology FLAME: high level abstraction and notation for dense linear algebra algorithms Not only a library: Notation for expressing algorithms Methodology for systematic derivation of algorithms Application Program Interfaces (APIs for representing the algorithms in code Tools and more Example: Matrix-matrix multiplication Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 9 Igual et al.

11 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication a Partitioning before iteration C 0 C 1 C 2 A B 0 B 1 B 2 b Computation in iteration C 1 += A x B 1 c Advancement of partitioning for next iteration C 0 C 1 C 2 A B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 10 Igual et al.

12 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME algorithm Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 11 Igual et al.

13 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME code 1 FLA_Obj BL, BR, B1, B2, B3 ; 2 FLA_Obj CL, CR, C1, C2, C3 ; 3 4 FLA_Part_1x2 ( B, 5 &BL, &BR, 0, FLA_LEFT ; 6 7 FLA_Part_1x2 ( C, 8 &CL, &CR, 0, FLA_LEFT ; 9 10 while ( FLA_Obj_width ( BL < FLA_Obj_width ( B { b = min ( FLA_Obj_width ( BR, nb_alg ; FLA_Repart_1x2_to_1x3 ( BL, / / BR, 15 &B0, / / &B1, &B2, 16 b, FLA_RIGHT ; FLA_Repart_1x2_to_1x3 ( CL, / / CR, 19 &C0, / / &C1, &C2, 20 b, FLA_RIGHT ; / / 23 FLA_Gemm( FLA_NO_TRANSPOSE, FLA_TRANSPOSE, FLA_MINUS_ONE, A, B1, FLA_ONE, C1 ; 24 / / FLA_Cont_with_1x3_to_1x2 ( &BL, / / &BR, 27 B0, B1, / / B2, 28 FLA_LEFT ; FLA_Cont_with_1x3_to_1x2 ( &CL, / / &CR, 31 C0, C1, / / C2, 32 FLA_LEFT ; 33 } Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 12 Igual et al.

14 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. Spark SPARK: AUTOMATIC GENERATION OF CODE SKELETONS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 13 Igual et al.

15 Contents Accelerating the Level-3 CUBLAS 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 14 Igual et al.

16 Accelerating the Level-3 CUBLAS Accelerated Level-3 BLAS routines SYMM C := AB + C, A symmetric and only the lower triangular part of this matrix is stored. SYRK C := C AA T, C symmetric and only the lower triangular part of this matrix is stored and computed. SYR2K C := C AB T BA T, C symmetric and only the upper triangular part of this matrix is stored and computed. TRMM C := AB + C, where A upper triangular. TRSM XA T = B, A lower triangular and B is overwritten with the solution X. Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 15 Igual et al.

17 Accelerating the Level-3 CUBLAS Accelerating the CUBLAS Three main ideas 1 GEMM-based implementations 2 Multiple algorithmic variants 3 Multiple block sizes Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 16 Igual et al.

18 GEMM-based SYRK Accelerating the Level-3 CUBLAS Algorithm: SYRK_GEMM(C, A ( CTL C TR Partition C, A C BL C BR where C TL is 0 0, A T has 0 rows while m(c TL < m(c do Determine block size b Repartition ( CTL C TR C BL C BR ( AT A B C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22 where C 11 is b b, A 1 has b rows, ( AT A B A 0 A 1 A 2 C 11 := C 11 A 1 A T 1 C 21 := C 21 A 2 A T 1 (SYRK (GEMM Continue with ( CTL C TR C BL endwhile C BR C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22, ( AT A B A 0 A 1 A 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 17 Igual et al.

19 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 18 Igual et al.

20 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PM(A, B, C ( ( AT CT Partition A, C A B C B where A T has 0 rows, C T has 0 rows while m(a T < m(a do Determine block size b Repartition ( A 0 ( C 0 AT A A 1 CT, C B C 1 A B 2 C 2 where A 1 has b rows, C 1 has b rows C 1 := C 1 + A 1 B Continue with ( AT A B endwhile A 0 A 1 A 2, ( CT C B C 0 C 1 C 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 19 Igual et al.

21 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 20 Igual et al.

22 Accelerating the Level-3 CUBLAS Varying the block size Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 21 Igual et al.

23 Contents Experimental results 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 22 Igual et al.

24 Experimental results Experimental setup Experimental setup CPU CPU frequency RAM memory GPU Processor GPU frequency Video memory Interconnection Dual Xeon QuadCore E Ghz 8 Gbytes Tesla C1060 Nvidia GT Ghz 4 Gbytes DDR3 PCIExpress Gen2 CUDA (CUBLAS version 2.3 (July 2009 Driver version Results in terms of GFLOPS (single precision Transfer times not included in results Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 23 Igual et al.

25 Experimental results Experimental results Results for square matrices: SYRK, SYMM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Symm New Symm CUBLAS Syr2k New Syr2k CUBLAS Syrk New Syrk 5 4 Speedup of New Symm Speedup of New Syrk Speedup of New Syr2k GFLOPS Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 24 Igual et al.

26 Experimental results Experimental results Results for square matrices: TRSM and TRMM. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Trsm New Trsm CUBLAS Trmm New Trmm CUBLAS Gemm New Gemm Speedup of New Trmm Speedup of New Trsm Speedup of New Gemm GFLOPS 300 Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 25 Igual et al.

27 Experimental results Experimental results Results for rectangular matrices: TRSM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Syr2k (k=32 New Syr2k (k=32 CUBLAS Trsm (m=32 New Trsm (m= Speedup of New Syr2k (k=32 Speedup of New Trsm (m= GFLOPS Speedup Second dimension (n Second dimension (n Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 26 Igual et al.

28 Experimental results Application: Symmetric Eigenvalue Problem (PPAM09 Reduction from full to tridiagonal form using LAPACK/SBR SBR - 1 GPU/8 CORES - Tuned CUBLAS SBR - 1 GPU/8 CORES - NVIDIA CUBLAS 2.2 SBR - 8 Intel cores - Intel MKL LAPACK - 8 Intel cores - Intel MKL GFLOPS Matrix rows/columns Using our SYMM and SYR2K implementations Speedup 2.2x when using GPU acceleration for SBR (19.2 vs 42 GFLOPS Speedup 4.3x when using GPU acceleration and tuned BLAS (19.2 vs 82 GFLOPS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 27 Igual et al.

29 Contents Conclusions and future work 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 28 Igual et al.

30 Conclusions and future work Conclusions and future work Performance boost with little effort for BLAS routines Attain considerable speedups compared with tuned CUBLAS Without writing one line of CUDA code Methodology appliable to other linear algebra routines Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 29 Igual et al.

31 Conclusions and future work Conclusions and future work Thank you! More information... [Level-3 BLAS on a GPU: Picking the Low Hanging Fruit] FLAME Working Note #37 May, 2009 HPCA Group at University Jaume I ( figual@icc.uji.es Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 30 Igual et al.

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso, Manuel F. Dolz 2, Francisco D. Igual 3, Rafael Mayo 4, and Enrique S. Quintana-Ortí

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007

REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007 REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS ROBERT A VAN DE GEIJN DRAFT October 23, 2007 Abstract The Cholesky factorization operation is used to demonstrate

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Updating an LU factorization with Pivoting. FLAME Working Note #21

Updating an LU factorization with Pivoting. FLAME Working Note #21 Updating an LU factorization with Pivoting FLAM Working Note # nrique S. Quintana-Ortí epartamento de Ingeniería y Ciencia de Computadores Universidad Jaume I Campus Riu Sec.7 Castellón, Spain quintana@icc.uji.es

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón

More information

Deriving Dense Linear Algebra Libraries

Deriving Dense Linear Algebra Libraries Under consideration for publication in Formal Aspects of Computing Deriving Dense Linear Algebra Libraries Paolo Bientinesi, John Gunnels 2, Margaret Myers 3, Enrique Quintana-Ortí 4, Tyler Rhodes 3, Robert

More information

The Science of Programming Matrix Computations

The Science of Programming Matrix Computations The Science of Programming Matrix Computations The Science of Programming Matrix Computations Robert A van de Geijn The University of Texas at Austin Enrique S Quintana-Ortí Universidad Jaume I Copyright

More information

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,

More information

arxiv: v1 [cs.dc] 19 Nov 2016

arxiv: v1 [cs.dc] 19 Nov 2016 A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí

More information

Knowledge-Based Automatic Generation of Algorithms and Code

Knowledge-Based Automatic Generation of Algorithms and Code Knowledge-Based Automatic Generation of Algorithms and Code Diego Fabregat Traver AICES, RWTH Aachen fabregat@aices.rwth-aachen.de Doctoral Defense Aachen, December 6th, 2013 Diego Fabregat (AICES, RWTH

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Week6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx

Week6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx Week6 Gaussian Elimination 61 Opening Remarks 611 Solving Linear Systems View at edx 193 Week 6 Gaussian Elimination 194 61 Outline 61 Opening Remarks 193 611 Solving Linear Systems 193 61 Outline 194

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Chapter 3 - From Gaussian Elimination to LU Factorization

Chapter 3 - From Gaussian Elimination to LU Factorization Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

AN ALTERNATIVE NOTATION FOR REPRESENTING DENSE LINEAR ALGEBRA ALGORITHMS

AN ALTERNATIVE NOTATION FOR REPRESENTING DENSE LINEAR ALGEBRA ALGORITHMS AN ALTERNATIVE NOTATION FOR REPRESENTING DENSE LINEAR ALGEA ALGORITHMS PAOLO BIENTINESI AND ROBERT A. VAN DE GEIJN Abstract. We present a notation that allows a dense linear algebra algorithm to be represented

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem Parallel lgorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem FLME Working Note #56 Jack Poulson Robert. van de Geijn Jeffrey Bennighof February, 2 bstract We discuss the parallel

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation

Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation Peter Benner 1, Enrique S. Quintana-Ortí 2, Gregorio Quintana-Ortí 2 1 Fakultät für Mathematik Technische Universität

More information

Matrix-Matrix Multiplication

Matrix-Matrix Multiplication Week5 Matrix-Matrix Multiplication 51 Opening Remarks 511 Composing Rotations Homework 5111 Which of the following statements are true: cosρ + σ + τ cosτ sinτ cosρ + σ sinρ + σ + τ sinτ cosτ sinρ + σ cosρ

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

Families of Algorithms for Reducing a Matrix to Condensed Form

Families of Algorithms for Reducing a Matrix to Condensed Form Families of Algorithms for Reducing a Matrix to Condensed Form FIELD G. VAN ZEE, The University of Texas at Austin ROBERT A. VAN DE GEIJN, The University of Texas at Austin GREGORIO QUINTANA-ORTí, Universidad

More information

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.

More information

arxiv: v1 [cs.ms] 11 Oct 2017

arxiv: v1 [cs.ms] 11 Oct 2017 Deriving Correct High-Performance Algorithms FLAME Working Note #86 arxiv:1710.04286v1 [cs.ms] Oct 2017 Devangi N. Parikh Margaret E. Myers Robert A. van de Geijn The University of Texas at Austin Austin,

More information

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem Parallel lgorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem JCK POULSON The University of Texas at ustin and ROBERT. VN DE GEIJN The University of Texas at ustin We discuss the

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Enrique S. Quintana-Ortí quintana@icc.uji.es Power-aware execution of linear algebra libraries 1 CECAM Lausanne, Sept. 2011 Motivation

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Notes on LU Factorization

Notes on LU Factorization Notes on LU Factorization Robert A van de Geijn Department of Computer Science The University of Texas Austin, TX 78712 rvdg@csutexasedu October 11, 2014 The LU factorization is also known as the LU decomposition

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability Mitglied der Helmholtz-Gemeinschaft Linear algebra tasks in Materials Science: optimization and portability ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Outline Jülich Supercomputing Center Chebyshev

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

Matrix-Vector Operations

Matrix-Vector Operations Week3 Matrix-Vector Operations 31 Opening Remarks 311 Timmy Two Space View at edx Homework 3111 Click on the below link to open a browser window with the Timmy Two Space exercise This exercise was suggested

More information

Algorithms for Reducing a Matrix to Condensed Form

Algorithms for Reducing a Matrix to Condensed Form lgorithms for Reducing a Matrix to Condensed Form FLME Working Note #53 Field G. Van Zee Robert. van de Geijn Gregorio Quintana-Ortí G. Joseph Elizondo October 3, 2 Revised January 3, 22 bstract In a recent

More information

arxiv: v2 [math.na] 7 Dec 2016

arxiv: v2 [math.na] 7 Dec 2016 HOUSEHOLDER QR FACTORIZATION WITH RANDOMIZATION FOR COLUMN PIVOTING HQRRP PER-GUNNAR MARTINSSON, GREGORIO QUINTANA ORTí, NATHAN HEAVNER, AND ROBERT VAN DE GEIJN arxiv:1512.02671v2 [math.na] 7 Dec 2016

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

More on Matrix Inversion

More on Matrix Inversion Week8 More on Matrix Inversion 8.1 Opening Remarks 8.1.1 When LU Factorization with Row Pivoting Fails View at edx The following statements are equivalent statements about A R n n : A is nonsingular. A

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Towards Mechanical Derivation of Krylov Solver Libraries

Towards Mechanical Derivation of Krylov Solver Libraries Towards Mechanical Derivation of Krylov Solver Libraries Victor Eijkhout Texas Advanced Computing Center with Paolo Bientinesi and Robert van de Geijn support: National Science Foundation award #0917096

More information

More Gaussian Elimination and Matrix Inversion

More Gaussian Elimination and Matrix Inversion Week7 More Gaussian Elimination and Matrix Inversion 7 Opening Remarks 7 Introduction 235 Week 7 More Gaussian Elimination and Matrix Inversion 236 72 Outline 7 Opening Remarks 235 7 Introduction 235 72

More information

Notes on Solving Linear Least-Squares Problems

Notes on Solving Linear Least-Squares Problems Notes on Solving Linear Least-Squares Problems Robert A. van de Geijn The University of Texas at Austin Austin, TX 7871 October 1, 14 NOTE: I have not thoroughly proof-read these notes!!! 1 Motivation

More information

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Matrix-Vector Operations

Matrix-Vector Operations Week3 Matrix-Vector Operations 31 Opening Remarks 311 Timmy Two Space View at edx Homework 3111 Click on the below link to open a browser window with the Timmy Two Space exercise This exercise was suggested

More information

Exercise 1.3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow.

Exercise 1.3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow. 54 Chapter Introduction 9 s Exercise 3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow Exercise 5 Follow the instructions for this problem given on the class wiki For the

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

(Parallel) Algorithms for Two-sided Triangular Solves and Matrix Multiplication

(Parallel) Algorithms for Two-sided Triangular Solves and Matrix Multiplication (Parallel lgorithms for Two-sided Triangular Solves and Matrix Multiplication JCK POULSON, ROBERT. VN DE GEIJN, and JEFFREY BENNIGHOF, The University of Texas at ustin We discuss the parallel implementation

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Algorithms for Reducing a Matrix to Condensed Form

Algorithms for Reducing a Matrix to Condensed Form lgorithms for Reducing a Matrix to Condensed Form FLME Working Note #53 Field G. Van Zee Robert. van de Geijn Gregorio Quintana-Ortí G. Joseph Elizondo October 3, 2 bstract In a recent paper it was shown

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations

Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations José M. Badía 1, Peter Benner 2, Rafael Mayo 1, and Enrique S. Quintana-Ortí 1 1 Depto. de Ingeniería y Ciencia de Computadores,

More information

c 2016 Society for Industrial and Applied Mathematics

c 2016 Society for Industrial and Applied Mathematics SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Notes on Householder QR Factorization

Notes on Householder QR Factorization Notes on Householder QR Factorization Robert A van de Geijn Department of Computer Science he University of exas at Austin Austin, X 7872 rvdg@csutexasedu September 2, 24 Motivation A fundamental problem

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures Khairul Kabir University of Tennessee kkabir@vols.utk.edu Azzam Haidar

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information