Level-3 BLAS on a GPU

Size: px

Start display at page:

Download "Level-3 BLAS on a GPU"

Angelina Conley
6 years ago
Views:

1 Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain 2 Department of Computer Sciences. The University of Texas at Austin Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 1 Igual et al.

2 Motivation (I Motivation GPU vendors promise spectacular peak performances Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 2 Igual et al.

3 Motivation (II Motivation But real performances are not so optimistic CUBLAS peak performance on a T10 Peak performance CUBLAS 2.3 GEMM GFLOPS Matrix Size Limitations of the graphics-oriented architecture Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 3 Igual et al.

4 Motivation (III Motivation And current implementations can be quite poor Performance against CUBLAS on a T10 Peak performance CUBLAS 2.3 GEMM CUBLAS Syr2k CUBLAS Symm 800 GFLOPS Matrix Size Hard to efficiently program the GPU, even using CUDA Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 4 Igual et al.

5 Outline Motivation 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 5 Igual et al.

6 Contents Introduction 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 6 Igual et al.

7 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.

8 Introduction Introduction BLAS BLAS: Basic Linear Algebra Subprograms Lie in the heart of complex dense linear algebra algorithms Key in their final performance Tuned implementations for many architectures GotoBLAS, MKL, CUBLAS Goals Tune the performance of the latest implementation of CUBLAS Without low-level programming (CUDA Improving programmability: FLAME methodology Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 7 Igual et al.

9 Development of algorithms by blocks. The matrix-matrix product Contents 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 8 Igual et al.

10 Development of algorithms by blocks. The matrix-matrix product FLAME The FLAME methodology FLAME: high level abstraction and notation for dense linear algebra algorithms Not only a library: Notation for expressing algorithms Methodology for systematic derivation of algorithms Application Program Interfaces (APIs for representing the algorithms in code Tools and more Example: Matrix-matrix multiplication Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 9 Igual et al.

11 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication a Partitioning before iteration C 0 C 1 C 2 A B 0 B 1 B 2 b Computation in iteration C 1 += A x B 1 c Advancement of partitioning for next iteration C 0 C 1 C 2 A B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 10 Igual et al.

12 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME algorithm Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 11 Igual et al.

13 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. FLAME code 1 FLA_Obj BL, BR, B1, B2, B3 ; 2 FLA_Obj CL, CR, C1, C2, C3 ; 3 4 FLA_Part_1x2 ( B, 5 &BL, &BR, 0, FLA_LEFT ; 6 7 FLA_Part_1x2 ( C, 8 &CL, &CR, 0, FLA_LEFT ; 9 10 while ( FLA_Obj_width ( BL < FLA_Obj_width ( B { b = min ( FLA_Obj_width ( BR, nb_alg ; FLA_Repart_1x2_to_1x3 ( BL, / / BR, 15 &B0, / / &B1, &B2, 16 b, FLA_RIGHT ; FLA_Repart_1x2_to_1x3 ( CL, / / CR, 19 &C0, / / &C1, &C2, 20 b, FLA_RIGHT ; / / 23 FLA_Gemm( FLA_NO_TRANSPOSE, FLA_TRANSPOSE, FLA_MINUS_ONE, A, B1, FLA_ONE, C1 ; 24 / / FLA_Cont_with_1x3_to_1x2 ( &BL, / / &BR, 27 B0, B1, / / B2, 28 FLA_LEFT ; FLA_Cont_with_1x3_to_1x2 ( &CL, / / &CR, 31 C0, C1, / / C2, 32 FLA_LEFT ; 33 } Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 12 Igual et al.

14 Development of algorithms by blocks. The matrix-matrix product The matrix-matrix multiplication. Spark SPARK: AUTOMATIC GENERATION OF CODE SKELETONS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 13 Igual et al.

15 Contents Accelerating the Level-3 CUBLAS 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 14 Igual et al.

16 Accelerating the Level-3 CUBLAS Accelerated Level-3 BLAS routines SYMM C := AB + C, A symmetric and only the lower triangular part of this matrix is stored. SYRK C := C AA T, C symmetric and only the lower triangular part of this matrix is stored and computed. SYR2K C := C AB T BA T, C symmetric and only the upper triangular part of this matrix is stored and computed. TRMM C := AB + C, where A upper triangular. TRSM XA T = B, A lower triangular and B is overwritten with the solution X. Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 15 Igual et al.

17 Accelerating the Level-3 CUBLAS Accelerating the CUBLAS Three main ideas 1 GEMM-based implementations 2 Multiple algorithmic variants 3 Multiple block sizes Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 16 Igual et al.

18 GEMM-based SYRK Accelerating the Level-3 CUBLAS Algorithm: SYRK_GEMM(C, A ( CTL C TR Partition C, A C BL C BR where C TL is 0 0, A T has 0 rows while m(c TL < m(c do Determine block size b Repartition ( CTL C TR C BL C BR ( AT A B C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22 where C 11 is b b, A 1 has b rows, ( AT A B A 0 A 1 A 2 C 11 := C 11 A 1 A T 1 C 21 := C 21 A 2 A T 1 (SYRK (GEMM Continue with ( CTL C TR C BL endwhile C BR C 00 C 01 C 02 C 10 C 11 C 12 C 20 C 21 C 22, ( AT A B A 0 A 1 A 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 17 Igual et al.

19 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_MP(A, B, C Partition B ( B L B R, C ( CL C R where B L has 0 columns, C L has 0 columns while n(b L < n(b do Determine block size b Repartition ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 where B 1 has b columns, C 1 has b columns C 1 := C 1 + AB 1 Continue with ( BL B R ( B0 B 1 B 2, ( CL C R ( C0 C 1 C 2 endwhile Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 18 Igual et al.

20 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PM(A, B, C ( ( AT CT Partition A, C A B C B where A T has 0 rows, C T has 0 rows while m(a T < m(a do Determine block size b Repartition ( A 0 ( C 0 AT A A 1 CT, C B C 1 A B 2 C 2 where A 1 has b rows, C 1 has b rows C 1 := C 1 + A 1 B Continue with ( AT A B endwhile A 0 A 1 A 2, ( CT C B C 0 C 1 C 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 19 Igual et al.

21 Accelerating the Level-3 CUBLAS Multiple algorithmic variants Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 20 Igual et al.

22 Accelerating the Level-3 CUBLAS Varying the block size Algorithm: GEMM_PP(A, B, C Partition A ( ( BT A L A R, B B B where A L has 0 columns, B T has 0 rows while n(a L < n(a do Determine block size b Repartition ( AL A R ( A0 A 1 A 2, ( BT B B where A 1 has b columns, B 1 has b rows C := C + A 1 B 1 Continue with ( AL A R ( A0 A 1 A 2, ( BT B B endwhile B 0 B 1 B 2 B 0 B 1 B 2 Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 21 Igual et al.

23 Contents Experimental results 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 22 Igual et al.

24 Experimental results Experimental setup Experimental setup CPU CPU frequency RAM memory GPU Processor GPU frequency Video memory Interconnection Dual Xeon QuadCore E Ghz 8 Gbytes Tesla C1060 Nvidia GT Ghz 4 Gbytes DDR3 PCIExpress Gen2 CUDA (CUBLAS version 2.3 (July 2009 Driver version Results in terms of GFLOPS (single precision Transfer times not included in results Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 23 Igual et al.

25 Experimental results Experimental results Results for square matrices: SYRK, SYMM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Symm New Symm CUBLAS Syr2k New Syr2k CUBLAS Syrk New Syrk 5 4 Speedup of New Symm Speedup of New Syrk Speedup of New Syr2k GFLOPS Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 24 Igual et al.

26 Experimental results Experimental results Results for square matrices: TRSM and TRMM. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Trsm New Trsm CUBLAS Trmm New Trmm CUBLAS Gemm New Gemm Speedup of New Trmm Speedup of New Trsm Speedup of New Gemm GFLOPS 300 Speedup Matrix Size (m=n=k Matrix Size (m=n=k Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 25 Igual et al.

27 Experimental results Experimental results Results for rectangular matrices: TRSM and SYR2K. Performance against CUBLAS on a T10 Speedup against CUBLAS on a T CUBLAS Syr2k (k=32 New Syr2k (k=32 CUBLAS Trsm (m=32 New Trsm (m= Speedup of New Syr2k (k=32 Speedup of New Trsm (m= GFLOPS Speedup Second dimension (n Second dimension (n Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 26 Igual et al.

28 Experimental results Application: Symmetric Eigenvalue Problem (PPAM09 Reduction from full to tridiagonal form using LAPACK/SBR SBR - 1 GPU/8 CORES - Tuned CUBLAS SBR - 1 GPU/8 CORES - NVIDIA CUBLAS 2.2 SBR - 8 Intel cores - Intel MKL LAPACK - 8 Intel cores - Intel MKL GFLOPS Matrix rows/columns Using our SYMM and SYR2K implementations Speedup 2.2x when using GPU acceleration for SBR (19.2 vs 42 GFLOPS Speedup 4.3x when using GPU acceleration and tuned BLAS (19.2 vs 82 GFLOPS Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 27 Igual et al.

29 Contents Conclusions and future work 1 Introduction 2 Development of algorithms by blocks. The matrix-matrix product 3 Accelerating the Level-3 CUBLAS 4 Experimental results 5 Conclusions and future work Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 28 Igual et al.

30 Conclusions and future work Conclusions and future work Performance boost with little effort for BLAS routines Attain considerable speedups compared with tuned CUBLAS Without writing one line of CUDA code Methodology appliable to other linear algebra routines Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 29 Igual et al.

31 Conclusions and future work Conclusions and future work Thank you! More information... [Level-3 BLAS on a GPU: Picking the Low Hanging Fruit] FLAME Working Note #37 May, 2009 HPCA Group at University Jaume I ( figual@icc.uji.es Level-3 BLAS on a GPU: Picking the Low Hanging Fruit 30 Igual et al.

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso, Manuel F. Dolz 2, Francisco D. Igual 3, Rafael Mayo 4, and Enrique S. Quintana-Ortí