Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Size: px

Start display at page:

Download "Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries"

George Burke
5 years ago
Views:

1 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support Scientific Computing AUTH

2 Presentation Outline Problem Description Serial Implementation Parallel implementation Numerical Results Conclusions 2

3 Problem Description 3

4 Linear Algebra Linear System Solution (LAPACK/SCALAPACK) Ax=b Matrix-Matrix Multiplication (BLAS/PBLAS) Ax-b=0 4

5 Linear Algebra Libraries Serial Implementation BLAS (Basic Linear Algebra Subprograms) LAPACK (Linear Algebra PACKage) Parallel Implementation BLACS (Basic Linear Algebra Communication Subprograms) PBLAS (Parallel BLAS) SCALAPACK (Scalable LAPACK) 5

6 Example Case 6

7 Example Case N M 7

8 Example Case N N RHS M 8

9 Serial Implementation (Ax=b) DGESV: Computes the solution to a real system of linear equations A*X=B DGESV( N, NRHS, A, LDA, IPIV, B, LDB, INFO ) N : The number of linear equations. NRHS: The number of right hand sides. A: On entry, the N-by-N matrix A. On exit, the factors L and U of A. LDA: The leading dimension of the matrix A. IPIV: The pivot indices that define the permutation matrix P B: On entry, the matrix of right hand side matrix B. On exit, the N-by-NRHS solution matrix X. LDB: The leading dimension of the array B. INFO: If equal to zero the solve was successful. 9

Serial Implementation (Ax-b=0) DGEMM: Perform one of the matrix-matrix operations C = a*a*b + b*c DGEMM( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) TRANSA: Specifies if normal or

10 Serial Implementation (Ax-b=0) DGEMM: Perform one of the matrix-matrix operations C = a*a*b + b*c DGEMM( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) TRANSA: Specifies if normal or transpose matrix A will be used. TRANSB: Specifies if normal or transpose matrix B will be used. M: The number of rows of the matrix A and C. N: The number of columns of the matrix B and C. K: The number of columns of the matrix A and rows of the matrix B. ALPHA: The scalar alpha. A: The M-by-K matrix A. LDA: The leading dimension of the matrix A. B: The K-by-N matrix B. LDB: The leading dimension of the matrix A. BETA: The scalar beta. C: The M-by-N matrix C. LDC: The leading dimension of the matrix A. 10

11 Example Case (Parallel Implementation) 11

12 Example Case (Parallel Implementation) N N RHS M 12

13 Example Case (Parallel Implementation) 2 x 2 = 4 cpus N N RHS M 13

14 Example Case (Parallel Implementation) 3 x 2 = 6 cpus N N RHS M 14

15 Parallel Implementation (Ax=b) PDGESV: Computes the solution to a real system of linear equations A*X=B DGESV ( N, NRHS, A, LDA, IPIV, B, LDB, INFO ) PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO ) IA: The row index in the global array A. JA: The column index in the global array A. DESCA: The array descriptor for the distributed matrix A. IB: The row index in the global array B. JB: The column index in the global array B. DESCB: The array descriptor for the distributed matrix B. 15

16 Parallel Implementation (Ax-b=0) PDGEMM: Perform one of the matrix-matrix operations C = a*a*b + b*c DGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) PDGEMM( TRANSA, TRANSB, M, N, K, ALPHA, A, IA, JA, DESCA, B, IB, JB, DESCB, BETA, C, IC, JC, DESCC ) IA: The row index in the global array A. JA: The column index in the global array A. DESCA: The array descriptor for the distributed matrix A. IB: The row index in the global array B. JB: The column index in the global array B. DESCB: The array descriptor for the distributed matrix B. IC: The row index in the global array C. JC: The column index in the global array C. DESCC: The array descriptor for the distributed matrix C. 16

17 Serial Implementation Standard BLAS Goto BLAS ATLAS BLAS ACML (AMD Core Math Library) Intel MKL 17

18 Serial Implementation Results (Ax-b=0) 00:35 00:30 00:26 00:22 Time 00:17 00:13 00:09 00:04 blas goto 1p goto 8p 00: Size * Intel Xeon 2.33GHz 18

19 Serial Implementation Results (Ax=b) 14:00 12:00 10:00 Time 08:00 06:00 04:00 02:00 blas goto 1p goto 8p 00: Size * Intel Xeon 2.33GHz 19

Parallel Implementation Results (Ax-b=0) Time 00:43 00:39 00:35 00:30 00:26 00:22 00:17 00:13 00:09 00:04 00:00 0 500 1000

20 Parallel Implementation Results (Ax-b=0) Time 00:43 00:39 00:35 00:30 00:26 00:22 00:17 00:13 00:09 00:04 00: Size blas goto 1p goto 8p scalapack 1x8 scalapack 8x8 * Intel Xeon 2.33GHz 20

21 Parallel Implementation Results (Ax=b) 14:00 01:00 12:00 Time 10:00 00:40 08:00 06:00 00:20 04:00 02:00 blas goto 1p scalapack 1x8 goto 8p scalapack 8x8 scalapack 1x8 scalapack 8x8 00: Size * Intel Xeon 2.33GHz 21

22 Conclusions Optimized Linear Algebra Libraries improve performance Scale becomes better as the problems get bigger Distributed Memory Libraries can treat larger problems than Shared Memory Libraries 22

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization