Level-3 BLAS on a GPU
|
|
- Angelina Conley
- 6 years ago
- Views:
Transcription
1 Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain 2 Department of Computer Sciences. The University of Texas at Austin Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 1 Igual et al.
2 Motivation (I Motivation GPU vendors promise spectacular peak performances Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 2 Igual et al.
3 Motivation (II Motivation But real performances are not so optimistic CUBLAS peak performance on a T10 Peak performance CUBLAS 2.3 GEMM GFLOPS Matrix Size Limitations of the graphics-oriented architecture Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 3 Igual et al.
4 Motivation (III Motivation And current implementations can be quite poor Performance against CUBLAS on a T10 Peak performance CUBLAS 2.3 GEMM CUBLAS Syr2k CUBLAS Symm 800 GFLOPS Matrix Size Hard to efficiently program the GPU, even using CUDA Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 4 Igual et al.
5 Outline Motivation 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 5 Igual et al.
6 Contents Introduction 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 6 Igual et al.
7 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.
8 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.
9 Development of algorithms by blocks. The matrix-matrix product Contents 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 8 Igual et al.
10 Development of algorithms by blocks. The matrix-matrix product FLAME The FLAME methodology FLAME: high level abstraction and notation for dense linear algebra algorithms Not only a library: Notation for expressing algorithms Methodology for systematic derivation of algorithms Application Program Interfaces (APIs for representing the algorithms in code Tools and more Example: Matrix-matrix multiplication Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 9 Igual et al.
11 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication a Partitioning before iteration C 0 C 1 C 2 A B 0 B 1 B 2 b Computation in iteration C 1 += A x B 1 c Advancement of partitioning for next iteration C 0 C 1 C 2 A B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 10 Igual et al.
12 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME algorithm Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 11 Igual et al.
13 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME code 1 FLA_Obj BL, BR, B1, B2, B3 ; 2 FLA_Obj CL, CR, C1, C2, C3 ; 3 4 FLA_Part_1x2 ( B, 5 &BL, &BR, 0, FLA_LEFT ; 6 7 FLA_Part_1x2 ( C, 8 &CL, &CR, 0, FLA_LEFT ; 9 10 while ( FLA_Obj_width ( BL < FLA_Obj_width ( B { b = min ( FLA_Obj_width ( BR, nb_alg ; FLA_Repart_1x2_to_1x3 ( BL, / / BR, 15 &B0, / / &B1, &B2, 16 b, FLA_RIGHT ; FLA_Repart_1x2_to_1x3 ( CL, / / CR, 19 &C0, / / &C1, &C2, 20 b, FLA_RIGHT ; / / 23 FLA_Gemm( FLA_NO_TRANSPOSE, FLA_TRANSPOSE, FLA_MINUS_ONE, A, B1, FLA_ONE, C1 ; 24 / / FLA_Cont_with_1x3_to_1x2 ( &BL, / / &BR, 27 B0, B1, / / B2, 28 FLA_LEFT ; FLA_Cont_with_1x3_to_1x2 ( &CL, / / &CR, 31 C0, C1, / / C2, 32 FLA_LEFT ; 33 } Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 12 Igual et al.
14 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. Spark SPARK: AUTOMATIC GENERATION OF CODE SKELETONS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 13 Igual et al.
15 Contents Accelerating the Level-3 CUBLAS 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 14 Igual et al.
16 Accelerating the Level-3 CUBLAS Accelerated Level-3 BLAS routines SYMM C := AB + C, A symmetric and only the lower triangular part of this matrix is stored. SYRK C := C AA T, C symmetric and only the lower triangular part of this matrix is stored and computed. SYR2K C := C AB T BA T, C symmetric and only the upper triangular part of this matrix is stored and computed. TRMM C := AB + C, where A upper triangular. TRSM XA T = B, A lower triangular and B is overwritten with the solution X. Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 15 Igual et al.
17 Accelerating the Level-3 CUBLAS Accelerating the CUBLAS Three main ideas 1 GEMM-based implementations 2 Multiple algorithmic variants 3 Multiple block sizes Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 16 Igual et al.
18 GEMM-based SYRK Accelerating the Level-3 CUBLAS Algorithm: SYRK_GEMM(C, A ( CTL C TR Partition C, A C BL C BR where C TL is 0 0, A T has 0 rows while m(c TL < m(c do Determine block size b Repartition ( CTL C TR C BL C BR ( AT A B C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22 where C 11 is b b, A 1 has b rows, ( AT A B A 0 A 1 A 2 C 11 := C 11 A 1 A T 1 C 21 := C 21 A 2 A T 1 (SYRK (GEMM Continue with ( CTL C TR C BL endwhile C BR C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22, ( AT A B A 0 A 1 A 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 17 Igual et al.
19 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 18 Igual et al.
20 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PM(A, B, C ( ( AT CT Partition A, C A B C B where A T has 0 rows, C T has 0 rows while m(a T < m(a do Determine block size b Repartition ( A 0 ( C 0 AT A A 1 CT, C B C 1 A B 2 C 2 where A 1 has b rows, C 1 has b rows C 1 := C 1 + A 1 B Continue with ( AT A B endwhile A 0 A 1 A 2, ( CT C B C 0 C 1 C 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 19 Igual et al.
21 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 20 Igual et al.
22 Accelerating the Level-3 CUBLAS Varying the block size Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 21 Igual et al.
23 Contents Experimental results 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 22 Igual et al.
24 Experimental results Experimental setup Experimental setup CPU CPU frequency RAM memory GPU Processor GPU frequency Video memory Interconnection Dual Xeon QuadCore E Ghz 8 Gbytes Tesla C1060 Nvidia GT Ghz 4 Gbytes DDR3 PCIExpress Gen2 CUDA (CUBLAS version 2.3 (July 2009 Driver version Results in terms of GFLOPS (single precision Transfer times not included in results Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 23 Igual et al.
25 Experimental results Experimental results Results for square matrices: SYRK, SYMM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Symm New Symm CUBLAS Syr2k New Syr2k CUBLAS Syrk New Syrk 5 4 Speedup of New Symm Speedup of New Syrk Speedup of New Syr2k GFLOPS Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 24 Igual et al.
26 Experimental results Experimental results Results for square matrices: TRSM and TRMM. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Trsm New Trsm CUBLAS Trmm New Trmm CUBLAS Gemm New Gemm Speedup of New Trmm Speedup of New Trsm Speedup of New Gemm GFLOPS 300 Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 25 Igual et al.
27 Experimental results Experimental results Results for rectangular matrices: TRSM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Syr2k (k=32 New Syr2k (k=32 CUBLAS Trsm (m=32 New Trsm (m= Speedup of New Syr2k (k=32 Speedup of New Trsm (m= GFLOPS Speedup Second dimension (n Second dimension (n Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 26 Igual et al.
28 Experimental results Application: Symmetric Eigenvalue Problem (PPAM09 Reduction from full to tridiagonal form using LAPACK/SBR SBR - 1 GPU/8 CORES - Tuned CUBLAS SBR - 1 GPU/8 CORES - NVIDIA CUBLAS 2.2 SBR - 8 Intel cores - Intel MKL LAPACK - 8 Intel cores - Intel MKL GFLOPS Matrix rows/columns Using our SYMM and SYR2K implementations Speedup 2.2x when using GPU acceleration for SBR (19.2 vs 42 GFLOPS Speedup 4.3x when using GPU acceleration and tuned BLAS (19.2 vs 82 GFLOPS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 27 Igual et al.
29 Contents Conclusions and future work 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 28 Igual et al.
30 Conclusions and future work Conclusions and future work Performance boost with little effort for BLAS routines Attain considerable speedups compared with tuned CUBLAS Without writing one line of CUDA code Methodology appliable to other linear algebra routines Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 29 Igual et al.
31 Conclusions and future work Conclusions and future work Thank you! More information... [Level-3 BLAS on a GPU: Picking the Low Hanging Fruit] FLAME Working Note #37 May, 2009 HPCA Group at University Jaume I ( figual@icc.uji.es Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 30 Igual et al.
Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra
Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso, Manuel F. Dolz 2, Francisco D. Igual 3, Rafael Mayo 4, and Enrique S. Quintana-Ortí
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationREPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007
REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS ROBERT A VAN DE GEIJN DRAFT October 23, 2007 Abstract The Cholesky factorization operation is used to demonstrate
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationUpdating an LU factorization with Pivoting. FLAME Working Note #21
Updating an LU factorization with Pivoting FLAM Working Note # nrique S. Quintana-Ortí epartamento de Ingeniería y Ciencia de Computadores Universidad Jaume I Campus Riu Sec.7 Castellón, Spain quintana@icc.uji.es
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationDesign of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization
Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationAccelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction
Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón
More informationDeriving Dense Linear Algebra Libraries
Under consideration for publication in Formal Aspects of Computing Deriving Dense Linear Algebra Libraries Paolo Bientinesi, John Gunnels 2, Margaret Myers 3, Enrique Quintana-Ortí 4, Tyler Rhodes 3, Robert
More informationThe Science of Programming Matrix Computations
The Science of Programming Matrix Computations The Science of Programming Matrix Computations Robert A van de Geijn The University of Texas at Austin Enrique S Quintana-Ortí Universidad Jaume I Copyright
More informationNumerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms
Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,
More informationarxiv: v1 [cs.dc] 19 Nov 2016
A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí
More informationKnowledge-Based Automatic Generation of Algorithms and Code
Knowledge-Based Automatic Generation of Algorithms and Code Diego Fabregat Traver AICES, RWTH Aachen fabregat@aices.rwth-aachen.de Doctoral Defense Aachen, December 6th, 2013 Diego Fabregat (AICES, RWTH
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationWeek6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx
Week6 Gaussian Elimination 61 Opening Remarks 611 Solving Linear Systems View at edx 193 Week 6 Gaussian Elimination 194 61 Outline 61 Opening Remarks 193 611 Solving Linear Systems 193 61 Outline 194
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationChapter 3 - From Gaussian Elimination to LU Factorization
Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationAN ALTERNATIVE NOTATION FOR REPRESENTING DENSE LINEAR ALGEBRA ALGORITHMS
AN ALTERNATIVE NOTATION FOR REPRESENTING DENSE LINEAR ALGEA ALGORITHMS PAOLO BIENTINESI AND ROBERT A. VAN DE GEIJN Abstract. We present a notation that allows a dense linear algebra algorithm to be represented
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationParallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem
Parallel lgorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem FLME Working Note #56 Jack Poulson Robert. van de Geijn Jeffrey Bennighof February, 2 bstract We discuss the parallel
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationParallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation
Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation Peter Benner 1, Enrique S. Quintana-Ortí 2, Gregorio Quintana-Ortí 2 1 Fakultät für Mathematik Technische Universität
More informationMatrix-Matrix Multiplication
Week5 Matrix-Matrix Multiplication 51 Opening Remarks 511 Composing Rotations Homework 5111 Which of the following statements are true: cosρ + σ + τ cosτ sinτ cosρ + σ sinρ + σ + τ sinτ cosτ sinρ + σ cosρ
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationFamilies of Algorithms for Reducing a Matrix to Condensed Form
Families of Algorithms for Reducing a Matrix to Condensed Form FIELD G. VAN ZEE, The University of Texas at Austin ROBERT A. VAN DE GEIJN, The University of Texas at Austin GREGORIO QUINTANA-ORTí, Universidad
More informationBalanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems
Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.
More informationarxiv: v1 [cs.ms] 11 Oct 2017
Deriving Correct High-Performance Algorithms FLAME Working Note #86 arxiv:1710.04286v1 [cs.ms] Oct 2017 Devangi N. Parikh Margaret E. Myers Robert A. van de Geijn The University of Texas at Austin Austin,
More informationParallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem
Parallel lgorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem JCK POULSON The University of Texas at ustin and ROBERT. VN DE GEIJN The University of Texas at ustin We discuss the
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationAlgorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationPower-Aware Execution of Sparse and Dense Linear Algebra Libraries
Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Enrique S. Quintana-Ortí quintana@icc.uji.es Power-aware execution of linear algebra libraries 1 CECAM Lausanne, Sept. 2011 Motivation
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationNotes on LU Factorization
Notes on LU Factorization Robert A van de Geijn Department of Computer Science The University of Texas Austin, TX 78712 rvdg@csutexasedu October 11, 2014 The LU factorization is also known as the LU decomposition
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationMitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability
Mitglied der Helmholtz-Gemeinschaft Linear algebra tasks in Materials Science: optimization and portability ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Outline Jülich Supercomputing Center Chebyshev
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationMatrix-Vector Operations
Week3 Matrix-Vector Operations 31 Opening Remarks 311 Timmy Two Space View at edx Homework 3111 Click on the below link to open a browser window with the Timmy Two Space exercise This exercise was suggested
More informationAlgorithms for Reducing a Matrix to Condensed Form
lgorithms for Reducing a Matrix to Condensed Form FLME Working Note #53 Field G. Van Zee Robert. van de Geijn Gregorio Quintana-Ortí G. Joseph Elizondo October 3, 2 Revised January 3, 22 bstract In a recent
More informationarxiv: v2 [math.na] 7 Dec 2016
HOUSEHOLDER QR FACTORIZATION WITH RANDOMIZATION FOR COLUMN PIVOTING HQRRP PER-GUNNAR MARTINSSON, GREGORIO QUINTANA ORTí, NATHAN HEAVNER, AND ROBERT VAN DE GEIJN arxiv:1512.02671v2 [math.na] 7 Dec 2016
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationMore on Matrix Inversion
Week8 More on Matrix Inversion 8.1 Opening Remarks 8.1.1 When LU Factorization with Row Pivoting Fails View at edx The following statements are equivalent statements about A R n n : A is nonsingular. A
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationTowards Mechanical Derivation of Krylov Solver Libraries
Towards Mechanical Derivation of Krylov Solver Libraries Victor Eijkhout Texas Advanced Computing Center with Paolo Bientinesi and Robert van de Geijn support: National Science Foundation award #0917096
More informationMore Gaussian Elimination and Matrix Inversion
Week7 More Gaussian Elimination and Matrix Inversion 7 Opening Remarks 7 Introduction 235 Week 7 More Gaussian Elimination and Matrix Inversion 236 72 Outline 7 Opening Remarks 235 7 Introduction 235 72
More informationNotes on Solving Linear Least-Squares Problems
Notes on Solving Linear Least-Squares Problems Robert A. van de Geijn The University of Texas at Austin Austin, TX 7871 October 1, 14 NOTE: I have not thoroughly proof-read these notes!!! 1 Motivation
More informationACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS
ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationMatrix-Vector Operations
Week3 Matrix-Vector Operations 31 Opening Remarks 311 Timmy Two Space View at edx Homework 3111 Click on the below link to open a browser window with the Timmy Two Space exercise This exercise was suggested
More informationExercise 1.3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow.
54 Chapter Introduction 9 s Exercise 3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow Exercise 5 Follow the instructions for this problem given on the class wiki For the
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More information(Parallel) Algorithms for Two-sided Triangular Solves and Matrix Multiplication
(Parallel lgorithms for Two-sided Triangular Solves and Matrix Multiplication JCK POULSON, ROBERT. VN DE GEIJN, and JEFFREY BENNIGHOF, The University of Texas at ustin We discuss the parallel implementation
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationPerformance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationAlgorithms for Reducing a Matrix to Condensed Form
lgorithms for Reducing a Matrix to Condensed Form FLME Working Note #53 Field G. Van Zee Robert. van de Geijn Gregorio Quintana-Ortí G. Joseph Elizondo October 3, 2 bstract In a recent paper it was shown
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationParallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations
Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations José M. Badía 1, Peter Benner 2, Rafael Mayo 1, and Enrique S. Quintana-Ortí 1 1 Depto. de Ingeniería y Ciencia de Computadores,
More informationc 2016 Society for Industrial and Applied Mathematics
SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationNotes on Householder QR Factorization
Notes on Householder QR Factorization Robert A van de Geijn Department of Computer Science he University of exas at Austin Austin, X 7872 rvdg@csutexasedu September 2, 24 Motivation A fundamental problem
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationPerformance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures
Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures Khairul Kabir University of Tennessee kkabir@vols.utk.edu Azzam Haidar
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationThe Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More information