Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems
|
|
- Cornelius Howard
- 5 years ago
- Views:
Transcription
1 Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University of Tennessee and INRIA Runtime Vers la simulation numérique pétaflopique sur architectures parallèles hybrides École CEA-EDF-INRIA, Sophia, France, June 6-10, 2011 AGULLO - LANGOU QR Factorization Over Runtime Systems 1
2 Motivation Introduction Algorithm / Software design Must rethink the design of our software. J. Dongarra, yesterday s talk. QR factorization One algorithmic idea in numerical linear algebra is more important than all the others: QR factorization. L. N. Trefethen and D. Bau, Numerical Linear Algebra, SIAM, AGULLO - LANGOU QR Factorization Over Runtime Systems 2
3 Motivation Introduction Algorithm / Software design Must rethink the design of our software. J. Dongarra, yesterday s talk. QR factorization One algorithmic idea in numerical linear algebra is more important than all the others: QR factorization. L. N. Trefethen and D. Bau, Numerical Linear Algebra, SIAM, AGULLO - LANGOU QR Factorization Over Runtime Systems 2
4 Introduction Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 3
5 Introduction Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 3
6 Introduction Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 3
7 Introduction Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 3
8 Introduction Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 3
9 Objective of the talk Introduction Show that this 3-layers programmation paradigm: increases productivity; enables performance portability. Knowing everything you have ever wondered about QR factorization. AGULLO - LANGOU QR Factorization Over Runtime Systems 4
10 Objective of the talk Introduction Show that this 3-layers programmation paradigm: increases productivity; enables performance portability. Knowing everything you have ever wondered about QR factorization. AGULLO - LANGOU QR Factorization Over Runtime Systems 4
11 Outline Introduction 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 5
12 Outline QR factorization 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 6
13 Three-layers paradigm High-level algorithm QR factorization Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 7
14 Motivation for tile algorithms Need to design new algorithm in order to reduce communication enhance parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 8
15 1D tile QR - binary tree Reduce Algorithms: Introduc4on The QR factoriza4on of a long and skinny matrix with its data par44oned ver4cally across several processors arises in a wide range of applica4ons. Input: A is block distributed by rows Output: Q is block distributed by rows R is global R A 1 Q 1 A 2 Q 2 A 3 Q 3 AGULLO - LANGOU QR Factorization Over Runtime Systems 9
16 1D tile QR - binary tree Reduce Algorithms: Introduc4on Example of applica3ons: a) in linear least squares problems which the number of equa4ons is extremely larger than the number of unknowns b) in block itera4ve methods (itera4ve methods with mul4ple right hand sides or itera4ve eigenvalue solvers) c) in dense large and more square QR factoriza4on where they are used as the panel factoriza4on step AGULLO - LANGOU QR Factorization Over Runtime Systems 10
17 1D tile QR - binary tree Reduce Algorithms: Introduc4on Example of applica3ons: a) in block itera4ve methods (itera4ve methods with mul4ple right hand sides or itera4ve eigenvalue solvers), b) in dense large and more square QR factoriza4on where they are used as the panel factoriza4on step, or more simply c) in linear least squares problems which the number of equa4ons is extremely larger than the number of unknowns. The main characteris3cs of those three examples are that a) there is only one column of processors involved but several processor rows, b) all the data is known from the beginning, c) and the matrix is dense. Various methods already exist to perform the QR factoriza4on of such matrices: a) Gram Schmidt (mgs(row),cgs), b) Householder (qr2, qrf), c) or CholeskyQR. We present a new method: Allreduce Householder (rhh_qr3, rhh_qrf). AGULLO - LANGOU QR Factorization Over Runtime Systems 11
18 1D tile QR - binary tree The CholeskyQR Algorithm SYRK: C:= A T A ( mn 2 ) C i A i T A i CHOL: R := chol( C ) ( n 3 /3 ) TRSM: Q := A/R ( mn 2 ) R chol ( ) C / R Q A i i AGULLO - LANGOU QR Factorization Over Runtime Systems 12
19 1D tile QR - binary tree Bibligraphy A. Stathopoulos and K. Wu, A block orthogonaliza4on procedure with constant synchroniza4on requirements, SIAM Journal on Scien0fic Compu0ng, 23(6): , Popularized by itera4ve eigensolver libraries: 1) PETSc (Argonne Na4onal Lab.) through BLOPEX (A. Knyazev, UCDHSC), 2) HYPRE (Lawrence Livermore Na4onal Lab.) through BLOPEX, 3) Trilinos (Sandia Na4onal Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), 4) PRIMME (A. Stathopoulos, Coll. William & Mary ). AGULLO - LANGOU QR Factorization Over Runtime Systems 13
20 1D tile QR - binary tree Parallel distributed CholeskyQR The CholeskyQR method in the parallel distributed context can be described as follows: 1: SYRK: C:= A T A ( mn 2 ) 1. C i A i T A i 2: MPI_Reduce: C:= sum procs C (on proc 0) 3: CHOL: R := chol( C ) ( n 3 /3 ) 4: MPI_Bdcast Broadcast the R factor on proc 0 to all the other processors 5: TRSM: Q := A/R ( mn 2 ) 2. C C C 2 C 3 C R chol ( ) C 5. / R Q A i i AGULLO - LANGOU QR Factorization Over Runtime Systems 14
21 1D tile QR - binary tree In this experiment, we fix the problem: m=100,000 and n=50. Efficient enough? Time (sec) Performance (MFLOP/sec/proc) # of procs # of procs # of cholqr cgs mgs(row) qrf mgs procs (1.02) (3.73) 73.5 (6.81) 39.1 (12.78) (8.90) (0.54) 78.9 (3.17) 39.0 (6.41) 22.3 (11.21) (8.01) (0.27) 71.3 (1.75) 38.7 (3.23) 22.2 (5.63) (4.23) (0.14) 67.4 (0.93) 36.7 (1.70) 20.8 (3.01) (2.96) (0.09) 54.2 (0.58) 31.6 (0.99) 18.3 (1.71) (2.16) (0.08) 41.9 (0.37) 29.0 (0.54) 15.8 (0.99) 8.38 (1.87) MFLOP/sec/proc Time in sec AGULLO - LANGOU QR Factorization Over Runtime Systems 15
22 1D tile QR - binary tree Simple enough? 1. C i A T i A i 2. C C 1 + C 2 + C 3 + C chol ( C ) 5. \ R Q i A i int choleskyqr_a_v0(int mloc, int n, double *A, int lda, double *R, int ldr, MPI_Comm mpi_comm){ } int info; cblas_dsyrk( CblasColMajor, CblasUpper, CblasTrans, n, mloc, 1.0e+00, A, lda, 0e+00, R, ldr ); MPI_Allreduce( MPI_IN_PLACE, R, n*n, MPI_DOUBLE, MPI_SUM, mpi_comm ); lapack_dpotrf( lapack_upper, n, R, ldr, &info ); cblas_dtrsm( CblasColMajor, CblasRight, CblasUpper, CblasNoTrans, CblasNonUnit, mloc, n, 1.0e+00, R, ldr, A, lda ); return 0; ( and, OK, you might want to add an MPI user defined datatype to send only the upper part of R) AGULLO - LANGOU QR Factorization Over Runtime Systems 16
23 1D tile QR - binary tree A Q R 2 / A 2 I Q T Q E E E E E E E E E E E E 16 Stable enough? 1.00E E E E E E E E+14 κ 2 (A) 1.00E E E E E E E E+14 κ 2 (A) m=100, n=50 cholqr cgs mgs Householder cholqr cgs mgs Householder AGULLO - LANGOU QR Factorization Over Runtime Systems 17
24 1D tile QR - binary tree Parallel distributed CholeskyQR The CholeskyQR method in the parallel distributed context can be described as follows: 1: SYRK: C:= A T A ( mn 2 ) 1. C i A i T A i 2: MPI_Reduce: C:= sum procs C (on proc 0) 3: CHOL: R := chol( C ) ( n 3 /3 ) 4: MPI_Bdcast Broadcast the R factor on proc 0 to all the other processors 5: TRSM: Q := A/R ( mn 2 ) 2. C C C 2 C 3 C R chol ( ) C 5. / R Q A i i This method is extremely fast. For two reasons: 1. first, there is only one or two communica3ons phase, 2. second, the local computa3ons are performed with fast opera3ons. Another advantage of this method is that the resul4ng code is exactly four lines, 3. so the method is simple and relies heavily on other libraries. Despite all those advantages, 4. this method is highly unstable. AGULLO - LANGOU QR Factorization Over Runtime Systems 18
25 1D tile QR - binary tree On two processes A 0 Q 0 processes A 1 Q 1 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 19
26 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 20
27 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) R (0) 0 ( ) R (0) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 21
28 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 22
29 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 A 0 V 0 (0) Apply ( V (1) 0 to I n ) V (1) 1 0 n Q 0 (1) Q 1 (1) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 23
30 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 Q 0 (1) A 0 V 0 (0) Apply ( V (1) 0 to I n ) V (1) 1 0 n Q 0 (1) Q 1 (1) processes QR ( ) R 1 (0) (, ) Q 1 (1) A 1 V 1 (0) 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 24
31 1D tile QR - binary tree On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 Q (1) Apply ( to 0 ) 0 n A 0 V 0 (0) Apply ( V (1) 0 to I n ) V (1) 1 0 n Q 0 (1) Q 1 (1) V 0 (0) Q 0 processes QR ( ) R 1 (0) (, ) Apply ( to Q (1) 1 ) 0 n A 1 V 1 (0) V 1 (0) Q 1 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 25
32 1D tile QR - binary tree The big picture. A 0 Q 0 R A 1 Q 1 R processes A 2 A 3 R Q 2 R Q 3 A 4 Q 4 R A 5 Q 5 R A 6 Q 6 R A 4me QR AGULLO - LANGOU QR Factorization Over Runtime Systems 26
33 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 27
34 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 28
35 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 29
36 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 30
37 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 31
38 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 32
39 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 33
40 1D tile QR - binary tree The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 34
41 1D tile QR - binary tree The big picture. A 0 Q 0 R A 1 Q 1 R processes A 2 A 3 R Q 2 R Q 3 A 4 Q 4 R A 5 Q 5 R A 6 Q 6 R 4me computa3on communica3on AGULLO - LANGOU QR Factorization Over Runtime Systems 35
42 1D tile QR - binary tree Latency but also possibility of fast panel factoriza4on. DGEQR3 is the recursive algorithm (see Elmroth and Gustavson, 2000), DGEQRF and DGEQR2 are the LAPACK rou4nes. Times include QR and DLARFT. Run on Pen4um III. QR factoriza3on and construc3on of T m = 10,000 Perf in MFLOP/sec (Times in sec) n DGEQR3 DGEQRF DGEQR (0.29) 65.0 (0.77) 64.6 (0.77) (0.83) 62.6 (3.17) 65.3 (3.04) (1.60) 81.6 (5.46) 64.2 (6.94) (2.53) (7.09) 65.9 (11.98) MFLOP/sec m=1000,000, the x axis is n AGULLO - LANGOU QR Factorization Over Runtime Systems 36
43 1D tile QR - binary tree QR ( ) R 0 (0) ( ) When only R is wanted R (0) 0 QR ( R (0) 1 ) ( R (1) 0 ) R (1) 0 QR ( R (1) 2 ) ( R ) A 0 QR ( ) R 1 (0) ( ) A 1 processes QR ( ) R 2 (0) ( ) R (0) 2 QR ( R (0) 3 ) ( R (1) 2 ) A 2 QR ( ) R 3 (0) ( ) A 3 4me AGULLO - LANGOU QR Factorization Over Runtime Systems 37
44 1D tile QR - binary tree When only R is wanted: The MPI_Allreduce In the case where only R is wanted, instead of construc4ng our own tree, one can simply use MPI_Allreduce with a user defined opera4on. The opera4on we give to MPI is basically the Algorithm 2. It performs the opera4on: This binary opera4on is associa3ve and this is all MPI needs to use a user defined opera4on on a user defined datatype. Moreover, if we change the signs of the elements of R so that the diagonal of R holds posi4ve elements then the binary opera4on Rfactor becomes commuta3ve. The code becomes two lines: QR ( R 1 ) R R 2 lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info ); MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER, LILA_MPIOP_QR_UPPER, mpi_comm); AGULLO - LANGOU QR Factorization Over Runtime Systems 38
45 1D tile QR - binary tree Blue Gene L frost.ncar.edu Q and R: Strong scalability In this experiment, we fix the problem: m=1,000,000 and n=50. Then we increase the number of processors MFLOPs/sec/proc # of processors ReduceHH (QR3) ReduceHH (QRF) ScaLAPACK QRF ScaLAPACK QR2 AGULLO - LANGOU QR Factorization Over Runtime Systems 39
46 1D tile QR - binary tree Grid 5000 AGULLO - LANGOU QR Factorization Over Runtime Systems 40
47 1D tile QR - binary tree Grid 5000 Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 AGULLO - LANGOU QR Factorization Over Runtime Systems 41
48 1D tile QR - binary tree Cluster 1 Illustration of ScaLAPACK PDEGQR2 without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 AGULLO - LANGOU QR Factorization Over Runtime Systems 42
49 1D tile QR - binary tree Cluster 1 Illustration of ScaLAPACK PDEGQR2 with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 AGULLO - LANGOU QR Factorization Over Runtime Systems 43
50 1D tile QR - binary tree Cluster 1 Illustration of TSQR with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 AGULLO - LANGOU QR Factorization Over Runtime Systems 44
51 1D tile QR - binary tree Classical Gram Schmidt Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments model TSQR binary tree Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments 4 clusters, GFlops/sec model 2 clusters, GFlops/sec 10 2 m = 16.e cluster, GFlops/sec GFlop/sec 4 clusters, GFlops/sec 2 clusters, GFlops/sec m = 8.e+06 GFlop/sec m = 5.e+5 m = 2.e m 1 cluster, 9.27 GFlops/sec m AGULLO - LANGOU QR Factorization Over Runtime Systems 45
52 1D tile QR - binary tree Consider architecture: parallel case: P processing units problem: QR factorization of a m-by-n TS matrix (TS = m/p n) (main) assumption: one processor has at least mn/p part of the matrix and one processor performs at least 2mn2 P flops 1D tile QR algorithm with a binary tree theory: TSQR ScaLAPACK-like Lower ( ) bound 2mn # flops 2 P + 2n3 3 log P 2mn 2 P 2n3 mn 3P Θ 2 P n # words 2 2 log P n 2 2 log P n 2 2 log P # messages log P 2n log P log P AGULLO - LANGOU QR Factorization Over Runtime Systems 46
53 Communication for tile algorithms tall and skinny matrix, parallel distributed case 1D tile QR algorithm with a flat tree tall and skinny matrix, sequential case general matrix, sequential case general matrix, parallel distributed case AGULLO - LANGOU QR Factorization Over Runtime Systems 47
54 1D tile QR - flat tree cpu (γ) cache (M fast ) bus (α, β) main memory (M slow ) AGULLO - LANGOU QR Factorization Over Runtime Systems 48
55 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 49
56 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 50
57 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 51
58 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 52
59 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 53
60 1D tile QR - flat tree Main memory Cache.me AGULLO - LANGOU QR Factorization Over Runtime Systems 54
61 1D tile QR - flat tree Main memory.me AGULLO - LANGOU QR Factorization Over Runtime Systems 55
62 1D tile QR - flat tree Consider architecture: sequential case: one processing unit with cache of size (W) problem: QR factorization of a m-by-n TS matrix (TS = m n and W 3 2 n2 ) 1D tile QR algorithm with a flat tree theory: flat tree LAPACK-like Lower bound # flops 2mn 2 2mn 2 Θ(mn 2 ) # words mn m 2 n 2 2W mn mn mn # messages mn W 2W W AGULLO - LANGOU QR Factorization Over Runtime Systems 56
63 Communication for tile algorithms tall and skinny matrix, parallel distributed case 1D tile QR algorithm with a flat tree tall and skinny matrix, sequential case 1D tile QR algorithm with a flat tree general matrix, sequential case general matrix, parallel distributed case AGULLO - LANGOU QR Factorization Over Runtime Systems 57
64 2D tile QR QR factorization Binary tree Flat tree AGULLO - LANGOU QR Factorization Over Runtime Systems 58
65 2D tile QR QR factorization Binary tree Flat tree AGULLO - LANGOU QR Factorization Over Runtime Systems 59
66 2D tile QR - flat tree for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 60
67 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); V R T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 61
68 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); V R T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 62
69 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R V T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 63
70 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R V T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 64
71 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R V T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 65
72 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R V T A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 66
73 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R V A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 67
74 2D tile QR - flat tree 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R V A for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 68
75 2D tile QR - flat tree 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A B V B for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 69
76 2D tile QR - flat tree 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A B V B for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 70
77 2D tile QR - flat tree 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A B V B for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } AGULLO - LANGOU QR Factorization Over Runtime Systems 71
78 Lower bounds QR factorization 1. Extends previous lower bounds on the volume of communication for matrix-matrix multiplication from Hong and Kung (81) - sequential case: Ω( n 3 M ) Irony,Toledo,Tiskin (04) - parallel case: Ω( n 2 P ) 2. For LU, observe that: I 0 B I I 0 B A I 0 = A I I A B 0 0 I 0 0 I I therefore lower bound for matrix-matrix multiply (latency, bandwidth and operations) also holds for LU. AGULLO - LANGOU QR Factorization Over Runtime Systems 72
79 Lower bounds QR factorization 1. Extends previous lower bounds on the volume of communication for matrix-matrix multiplication from Hong and Kung (81) - sequential case: Ω( n 3 M ) Irony,Toledo,Tiskin (04) - parallel case: Ω( n 2 P ) 2. For LU, observe that: I 0 B I I 0 B A I 0 = A I I A B 0 0 I 0 0 I I therefore lower bound for matrix-matrix multiply (latency, bandwidth and operations) also holds for LU. AGULLO - LANGOU QR Factorization Over Runtime Systems 72
80 2D tile QR - binary tree # flops # words # messages 2D tile QR ScaLAPACK Algorithm binary tree PDGEQRF Lower bound ( 4 3 ) (n 3 /P ) ( ) ( 3 4 log P n 2 / ) P ( ( 3 P ) P) 8 log3 ( 4 3 ) (n 3 /P ) O ( n 3 /P ) ( ) ( 3 4 log P n 2 / ) P ( O n 2 / ) P ( ( 5 P ) P) 4 log2 (n) O 2D tile QR binary tree ScaLAPACK Performance models of parallel CAQR and ScaLAPACK s parallel QR factorization PDGEQRF on a square n n matrix with P processors, along with lower bounds on the number of flops, words, and messages. The matrix is stored in a 2-D P r P c block cyclic layout with square b b blocks. We choose b, P r, and P c optimally and independently for each algorithm. Everything (messages, words, and flops) is counted along the critical path. AGULLO - LANGOU QR Factorization Over Runtime Systems 73
81 2D tile QR - flat tree # flops # words 3 # messages 12 2D tile QR LAPACK Algorithm flat tree DGEQRF Lower bound ( 4 3 ) (n 3) n 3 W n 3 W 3/2 ( 4 3 ) (n 3) O ( n 3) ( ) 1 3 n4 O n 3 W W ( ) 1 2 n3 O n 3 W W 3/2 2D tile QR flat tree LAPACK Performance models of sequential CAQR and blocked sequential Householder QR on a square n n matrix with fast memory size W, along with lower bounds on the number of flops, words, and messages. AGULLO - LANGOU QR Factorization Over Runtime Systems 74
82 2D tile QR - flat tree cpu (γ) cache (M fast ) M fast = 1 (MB) M slow = 1 (GB) α = 0 (SEC) β = 10 8 (GB/SEC) γ = (GFLOPS/SEC) bus (α, β) main memory (M slow ) AGULLO - LANGOU QR Factorization Over Runtime Systems 75
83 2D tile QR - flat tree M fast = 10 6 M slow = 10 9 β = 10 8 γ = sequential case LU (or QR) factorization of square matrices percent of CPU peak performance /2 n = M fast problem lower bound upper bound for CAQR flat tree lower bound for the LAPACK algorithm 1/2 n = M slow n (matrix size) AGULLO - LANGOU QR Factorization Over Runtime Systems 76
84 2D tile QR - binary tree AGULLO - LANGOU QR Factorization Over Runtime Systems 77
85 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
86 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
87 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
88 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
89 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
90 Communication for tile algorithms Tile algorithms enable to reduce communication in the sequential case, and in the parallel case wrt existing software We can derive communication lower bounds for our problems We can attain (polylogarithmically) (assymptotically) communication lower bounds with tile algorithms Any tree is possible for the panel factorization We (Julien) used tile algorithms to minimize communication, We (Emmanuel) now explain them in the context of maximizing parallelism AGULLO - LANGOU QR Factorization Over Runtime Systems 78
91 Algorithms and libraries Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - ScaLAPACK, distributed memory: PBLAS Message passing Video: LAPACK 1 Thread AGULLO - LANGOU QR Factorization Over Runtime Systems 79
92 Algorithms and libraries Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - ScaLAPACK, distributed memory: PBLAS Message passing Video: LAPACK 1 Thread AGULLO - LANGOU QR Factorization Over Runtime Systems 79
93 Algorithms and libraries Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - ScaLAPACK, distributed memory: PBLAS Message passing Video: LAPACK 1 Thread AGULLO - LANGOU QR Factorization Over Runtime Systems 79
94 Algorithms and libraries Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - ScaLAPACK, distributed memory: PBLAS Message passing Video: LAPACK 1 Thread AGULLO - LANGOU QR Factorization Over Runtime Systems 79
95 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); AGULLO - LANGOU QR Factorization Over Runtime Systems 80
96 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); AGULLO - LANGOU QR Factorization Over Runtime Systems 80
97 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); AGULLO - LANGOU QR Factorization Over Runtime Systems 80
98 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); AGULLO - LANGOU QR Factorization Over Runtime Systems 80
99 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); AGULLO - LANGOU QR Factorization Over Runtime Systems 80
100 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); 140 LAPACK 3.2 Gflop/s Matrix size (N) Intel Xeon E7340 quad-socket quad-core (16 cores total) AGULLO - LANGOU QR Factorization Over Runtime Systems 80
101 LAPACK QR factorization Algorithms and libraries LAPACK QR factorization Block algorithm (Movie L1) Fork-join parallelism Multithreaded BLAS (Movie L4); Need for tile algorithms! AGULLO - LANGOU QR Factorization Over Runtime Systems 80
102 Three-layers paradigm Tile QR High-level algorithm 2D Tile QR - flat tree Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 81
103 DAG - 2D tile QR - flat tree Tile QR for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
104 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
105 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
106 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
107 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
108 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
109 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
110 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
111 DAG - 2D tile QR - flat tree Tile QR 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
112 DAG - 2D tile QR - flat tree Tile QR 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); dssrb 1,1 for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
113 DAG - 2D tile QR - flat tree Tile QR 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); dssrb 1,1 for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
114 DAG - 2D tile QR - flat tree Tile QR 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dssrb 1,1 dssrb 1,1 dssrb 1,1 dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
115 DAG - 2D tile QR - flat tree Tile QR 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dssrb 1,1 dssrb 1,1 dssrb 1,1 dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
116 DAG - 2D tile QR - flat tree Tile QR 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dssrb 1,1 dssrb 1,1 dssrb 1,1 dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } AGULLO - LANGOU QR Factorization Over Runtime Systems 82
117 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB QR factorization Tile QR 2D tile QR - flat tree - summary FOR k = 0..TILES-1 A[k][k], T[k][k] DGRQRT(A[k][k]) FOR m = k+1..tiles-1 A[k][k], A[m][k], T[m][k] DTSQRT(A[k][k], A[m][k], T[m][k]) FOR n = k+1..tiles-1 A[k][n] DLARFB(A[k][k], T[k][k], A[k][n]) FOR m = k+1..tiles-1 A[k][n], A[m][n] DSSRFB(A[m][k], T[m][k], A[k][n], A[m][n]) DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT DSSRFB DSSRFB R C1 T V2 DTSQRT T V2 C2 DSSRFB DSSRFB Fine granularity; Tile layout; Different numerical properties; DAG to schedule. AGULLO - LANGOU QR Factorization Over Runtime Systems 83
118 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB QR factorization Tile QR 2D tile QR - flat tree - summary FOR k = 0..TILES-1 A[k][k], T[k][k] DGRQRT(A[k][k]) FOR m = k+1..tiles-1 A[k][k], A[m][k], T[m][k] DTSQRT(A[k][k], A[m][k], T[m][k]) FOR n = k+1..tiles-1 A[k][n] DLARFB(A[k][k], T[k][k], A[k][n]) FOR m = k+1..tiles-1 A[k][n], A[m][n] DSSRFB(A[m][k], T[m][k], A[k][n], A[m][n]) DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT DSSRFB DSSRFB R C1 T V2 DTSQRT T V2 C2 DSSRFB DSSRFB Fine granularity; Tile layout; Different numerical properties; DAG to schedule. AGULLO - LANGOU QR Factorization Over Runtime Systems 83
119 Runtime System for multicore architectures Outline 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 84
120 Runtime System for multicore architectures Intrusive (static) scheduler : Plasma 2.0 Three-layers paradigm High-level algorithm 2D Tile QR - flat tree Runtime System Intrusive (static) scheduler Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 85
121 Runtime System for multicore architectures Intrusive (static) scheduler : Plasma 2.0 Implementation with a static pipeline (Plasma 2.0) void dgeqrt(double *RV1, double *T); void dtsqrt(double *R, double *V2, double *T); void dlarfb(double *V1, double *T, double *C1); void dssrfb(double *V2, double *T, double *C1, double *C2); k = 0; n = my_core_id; while (n >= TILES) { k++; n = n-tiles+k; } m = k; while (k < TILES && n < TILES) { next_n = n; next_m = m; next_k = k; } next_m++; if (next_m == TILES) { next_n += cores_num; while (next_n >= TILES && next_k < TILES) { next_k++; next_n = next_n-tiles+next_k; } next_m = next_k; } if (n == k) { if (m == k) { while(progress[k][k]!= k-1); dgeqrt(a[k][k], T[k][k]); progress[k][k] = k; } else{ while(progress[m][k]!= k-1); dtsqrt(a[k][k], A[m][k], T[m][k]); progress[m][k] = k; } } else { if (m == k) { while(progress[k][k]!= k); while(progress[k][n]!= k-1); dlarfb(a[k][k], T[k][k], A[k][n]); } else { while(progress[m][k]!= k); while(progress[m][n]!= k-1); dssrfb(a[m][k], T[m][k], A[k][n], A[m][n]); progress[m][n] = k; } } n = next_n; m = next_m; k = next_k; Work partitioned in one dimension (by block-columns). Cyclic assignment of work across all steps of the factorization (pipelining of factorization steps). Process tracking by a global progress table. Stall on dependencies (busy waiting) DGEQRT DTSQRT DLARFB DSSRFB 6 AGULLO - LANGOU QR Factorization Over Runtime Systems 86
122 Runtime System for multicore architectures Intrusive (static) scheduler : Plasma 2.0 Movies Plasma 1 core Plasma 4 cores AGULLO - LANGOU QR Factorization Over Runtime Systems 87
123 Runtime System for multicore architectures Intel Xeon - 16 cores machine Performance Node: quad-socket quad-core Intel64 processors (16 cores). Intel Xeon processor: quad-core; frequency: 2,4 GHz. Theoretical peak: 9.6 Gflop/s/core; Gflop/s/node. System and compilers: Linux ; Intel Compilers AGULLO - LANGOU QR Factorization Over Runtime Systems 88
124 Runtime System for multicore architectures Intel64-16 cores - QR Performance PLASMA 2.0 MKL 10.1 LAPACK Gflop/s Matrix size (N) AGULLO - LANGOU QR Factorization Over Runtime Systems 89
125 Runtime System for multicore architectures Performance IBM Power6-32 cores machine Node: 16 dual-core Power6 processors (32 cores). Power6 processor: dual-core; each core 2-way SMT; L1: 64kB data + 64 kb instructions; L2: 4 MB per core, accessible by the other core; L3: 32 MB per processor, one controller per core (80 MB/s); frequency: 4,7 GHz. Theoretical peak: 18.8 Gflop/s/core; Gflop/s/node. System and compilers: AIX 5.3; xlf version 12.1; xlc version AGULLO - LANGOU QR Factorization Over Runtime Systems 90
126 Runtime System for multicore architectures Power6-32 cores - QR Performance PLASMA 2.0 ESSL 4.3 LAPACK Gflop/s Matrix size (N) AGULLO - LANGOU QR Factorization Over Runtime Systems 91
127 Runtime System for multicore architectures Three-layers paradigm Runtime System High-level algorithm Runtime System DAG vs Fork-Join Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 92
128 Runtime System for multicore architectures A few runtime systems DAG vs Fork-Join (Kurzak et al. 09) Cilk/Cilk++ [Fork-join]; SMP Superscalar (SMPSs) [DAG] GPUSs StarSs; StarPU; Quark (Plasma, since version 2.1); DAGuE (dplasma); SuperMatrix; Intel Threading Building Blocks; Charm++;... AGULLO - LANGOU QR Factorization Over Runtime Systems 93
129 Runtime System for multicore architectures A few runtime systems DAG vs Fork-Join (Kurzak et al. 09) Cilk/Cilk++ [Fork-join]; SMP Superscalar (SMPSs) [DAG] GPUSs StarSs; StarPU; Quark (Plasma, since version 2.1); DAGuE (dplasma); SuperMatrix; Intel Threading Building Blocks; Charm++;... AGULLO - LANGOU QR Factorization Over Runtime Systems 93
130 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB Runtime System for multicore architectures Fork-Join model - Cilk - 2D DAG vs Fork-Join (Kurzak et al. 09) DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT R DSSRFB C1 DSSRFB T V2 T V2 C2 DTSQRT DSSRFB DSSRFB AGULLO - LANGOU QR Factorization Over Runtime Systems 94
131 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB Runtime System for multicore architectures Fork-Join model - Cilk - 1D DAG vs Fork-Join (Kurzak et al. 09) DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT R DSSRFB C1 DSSRFB T V2 T V2 C2 DTSQRT DSSRFB DSSRFB AGULLO - LANGOU QR Factorization Over Runtime Systems 95
132 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DAG - SMPSs Runtime System for multicore architectures DAG vs Fork-Join (Kurzak et al. 09) DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT R DSSRFB C1 DSSRFB T V2 T V2 C2 DTSQRT DSSRFB DSSRFB AGULLO - LANGOU QR Factorization Over Runtime Systems 96
133 Runtime System for multicore architectures Traces [Kurzak et al. 09] DAG vs Fork-Join (Kurzak et al. 09) AGULLO - LANGOU QR Factorization Over Runtime Systems 97
134 Runtime System for multicore architectures DAG vs Fork-Join (Kurzak et al. 09) Performance - Intel 16 cores [Kurzak et al. 09] AGULLO - LANGOU QR Factorization Over Runtime Systems 98
135 Outline Using Accelerators 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 99
136 Using Accelerators Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 100
137 Using Accelerators Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 100
138 Using Accelerators Three-layers paradigm High-level algorithm Runtime System Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 100
139 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
140 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
141 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
142 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
143 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
144 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
145 GPU kernels Using Accelerators GPU kernels Kernels for update (ormqr and tsmqr) Fully on GPU Kernels for panel factorization (geqrt and tsqrt) Hybrid implementation CPU + GPU CUDA tsmqr kernel: ( ) [ ( ) ( ) ] A j T ( ) ki I I A j = I T j ki, A mi Vj Vj A mi 1. D 1 work = Aj ki + VT j Ami; 2. D 2 work = TjD1 work ; Aj ki = Aj ki D2 work ; 3. A mi = A mi V jd 2 work. AGULLO - LANGOU QR Factorization Over Runtime Systems 101
146 Architecture Using Accelerators GPU kernels AMD Opteron 8358 SE CPU, 4 4, 2.4GHz, 4 8GB NVIDIA Tesla S1070 GPU, 4 240, 1.3GHz, 4 4GB single precision: peak: 3067Gflop/s ( ) sgemm: 1908Gflop/s ( ) double precision: peak: 498.6Gflop/s ( ) dgemm: 467.2Gflop/s ( ) AGULLO - LANGOU QR Factorization Over Runtime Systems 102
147 Tuning Using Accelerators Tuning Panel factorization (sgeqrt kernel) Update phase (stsmqr kernel) AGULLO - LANGOU QR Factorization Over Runtime Systems 103
148 Performance Using Accelerators Static scheduling Single precision - 4 GPUs Single precision - 3 GPUs Single precision - 2 GPUs Single precision - 1 GPUs Double precision - 4 GPUs Double precision - 3 GPUs Double precision - 2 GPUs Double precision - 1 GPUs Gflop/s Matrix order How to use the 16 CPU cores? AGULLO - LANGOU QR Factorization Over Runtime Systems 104
149 Performance Using Accelerators Static scheduling Single precision - 4 GPUs Single precision - 3 GPUs Single precision - 2 GPUs Single precision - 1 GPUs Double precision - 4 GPUs Double precision - 3 GPUs Double precision - 2 GPUs Double precision - 1 GPUs Gflop/s Matrix order How to use the 16 CPU cores? AGULLO - LANGOU QR Factorization Over Runtime Systems 104
150 Using Accelerators Three-layers paradigm StarPU High-level algorithm Runtime System StarPU Device kernels CPU core GPU AGULLO - LANGOU QR Factorization Over Runtime Systems 105
151 Using Accelerators StarPU GPU-enabled runtime systems Cilk/Cilk++; SMP Superscalar (SMPSs) GPUSs StarSs; StarPU; Quark (Plasma, since version 2.1); DAGuE (dplasma); SuperMatrix; Intel Threading Building Blocks; Charm++;... AGULLO - LANGOU QR Factorization Over Runtime Systems 106
152 Using Accelerators StarPU GPU-enabled runtime systems Cilk/Cilk++; SMP Superscalar (SMPSs) GPUSs StarSs; StarPU; Quark (Plasma, since version 2.1); DAGuE (dplasma); SuperMatrix; Intel Threading Building Blocks; Charm++;... AGULLO - LANGOU QR Factorization Over Runtime Systems 106
153 Using Accelerators StarPU GPU-enabled runtime systems Cilk/Cilk++; SMP Superscalar (SMPSs) GPUSs StarSs; StarPU; Quark (Plasma, since version 2.1); DAGuE (dplasma); SuperMatrix; Intel Threading Building Blocks; Charm++;... AGULLO - LANGOU QR Factorization Over Runtime Systems 106
154 Using Accelerators StarPU The StarPU runtime system [Augonnet et al.] Data management: Checks dependences; Ensures coherency; Supports: SMP/Multicore Processors (x86, PPC,... ); NVIDIA GPUs; OpenCL devices; Cell Processors (experimental). Scheduling module. AGULLO - LANGOU QR Factorization Over Runtime Systems 107
155 Using Accelerators StarPU 2D Tile QR - flat tree - over StarPU for (k = 0; k < min(mt, NT); k++){ starpu_insert_task(&cl_zgeqrt, k, k,...); for (n = k+1; n < NT; n++) starpu_insert_task(&cl_zunmqr, k, n,...); for (m = k+1; m < MT; m++){ starpu_insert_task(&cl_ztsqrt, m, k,...); } } for (n = k+1; n < NT; n++) starpu_insert_task(&cl_ztsmqr, m, n, k,...); See code AGULLO - LANGOU QR Factorization Over Runtime Systems 108
156 Using Accelerators StarPU 2D Tile QR - flat tree - over StarPU for (k = 0; k < min(mt, NT); k++){ starpu_insert_task(&cl_zgeqrt, k, k,...); for (n = k+1; n < NT; n++) starpu_insert_task(&cl_zunmqr, k, n,...); for (m = k+1; m < MT; m++){ starpu_insert_task(&cl_ztsqrt, m, k,...); } } for (n = k+1; n < NT; n++) starpu_insert_task(&cl_ztsmqr, m, n, k,...); See code AGULLO - LANGOU QR Factorization Over Runtime Systems 108
157 DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DLARFB DLARFB DTSQRT DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DGEQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB Using Accelerators StarPU Relieving anti-dependencies (SMPSs trick reminder) DGEQRT DLARFB DLARFB T R V1 T V1 C1 DTSQRT R DSSRFB C1 DSSRFB T V2 T V2 C2 DTSQRT DSSRFB DSSRFB AGULLO - LANGOU QR Factorization Over Runtime Systems 109
158 Using Accelerators StarPU Relieving anti-dependencies (using StarPU) GFlop/s Copy block RW access mode Matrix order AGULLO - LANGOU QR Factorization Over Runtime Systems 110
159 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order GREEDY Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
160 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order HEFT-TM GREEDY Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
161 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order HEFT-TM-PR HEFT-TM GREEDY Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
162 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order HEFT-TMDP HEFT-TM-PR HEFT-TM GREEDY Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
163 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order HEFT-TMDP-PR HEFT-TMDP HEFT-TM-PR HEFT-TM GREEDY Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
164 Using Accelerators StarPU Impact of the scheduling policy Gflop/s Matrix order HEFT-TMDP-PR HEFT-TM-PR Name Policy description heft-tmdp-pr heft-tmdp with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tm HEFT based on Task duration Models (T data transfert + T computation ) greedy Greedy policy AGULLO - LANGOU QR Factorization Over Runtime Systems 111
165 Using Accelerators StarPU Impact of the Data Penalty on the total amount of data movement Gflop/s Matrix order HEFT-TMDP-PR HEFT-TM-PR Matrix order heft-tmdp-pr 1.9 GB 16.3 GB 25.4 GB 41.6 GB heft-tm-pr 3.8 GB 57.2 GB GB GB AGULLO - LANGOU QR Factorization Over Runtime Systems 112
166 Performance Using Accelerators StarPU GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order + 200Gflop/s but 12 cores = 150Gflop/s AGULLO - LANGOU QR Factorization Over Runtime Systems 113
167 Performance Using Accelerators StarPU GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order + 200Gflop/s but 12 cores = 150Gflop/s AGULLO - LANGOU QR Factorization Over Runtime Systems 113
168 Heterogeneity Using Accelerators StarPU Kernel CPU GPU Speedup sgeqrt 9 Gflops 60 Gflops 6 stsqrt 12 Gflops 67 Gflops 6 sormqr 8.5 Gflops 227 Gflops 27 stsmqr 10 Gflops 285 Gflops 27 Task distribution observed on StarPU: sgeqrt: 20% of tasks on GPUs stsmqr: 92.5% of tasks on GPUs Taking advantage of heterogeneity! Only do what you are good for Don t do what you are not good for AGULLO - LANGOU QR Factorization Over Runtime Systems 114
169 Outline Enhancing parallelism 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 115
170 Enhancing parallelism DAG of a 4x4 tile matrix Enhancing parallelism 2D tile QR - flat tree 2D tile QR - hybrid binary/flat tree AGULLO - LANGOU QR Factorization Over Runtime Systems 116
171 Enhancing parallelism DAG of a 4x4 tile matrix Enhancing parallelism 2D tile QR - flat tree 2D tile QR - hybrid binary/flat tree AGULLO - LANGOU QR Factorization Over Runtime Systems 116
172 Enhancing parallelism Enhancing parallelism 2D tile QR - hybrid binary/flat tree First panel factorization and corresponding updates. AGULLO - LANGOU QR Factorization Over Runtime Systems 117
173 Enhancing parallelism Enhancing parallelism 2D tile QR - hybrid binary/flat tree Second panel factorization and corresponding updates. AGULLO - LANGOU QR Factorization Over Runtime Systems 117
174 Enhancing parallelism Enhancing parallelism 2D tile QR - hybrid binary/flat tree Final panel factorization. AGULLO - LANGOU QR Factorization Over Runtime Systems 117
175 Enhancing parallelism Enhancing parallelism 2D tile QR - hybrid binary/flat tree Final panel factorization. AGULLO - LANGOU QR Factorization Over Runtime Systems 117
176 Enhancing parallelism Enhancing parallelism 16x2 tile matrix - 1 domain (flat tree) DAG Traces (8 cores) AGULLO - LANGOU QR Factorization Over Runtime Systems 118
177 Enhancing parallelism Enhancing parallelism 16x2 tile matrix - 8 domains (hybrid tree) DAG Traces (8 cores) AGULLO - LANGOU QR Factorization Over Runtime Systems 119
178 Enhancing parallelism Enhancing parallelism 16x2 tile matrix - 16 domains (binary tree) DAG Traces (8 cores) AGULLO - LANGOU QR Factorization Over Runtime Systems 120
179 Enhancing parallelism Enhancing parallelism 32x4 tile matrix - 1 domain (flat tree) AGULLO - LANGOU QR Factorization Over Runtime Systems 121
180 Enhancing parallelism Enhancing parallelism 32x4 tile matrix - 16 domains (hybrid tree) AGULLO - LANGOU QR Factorization Over Runtime Systems 122
181 Enhancing parallelism Enhancing parallelism 32x4 tile matrix - 16 domains (hybrid tree) DAG Traces (8 cores) AGULLO - LANGOU QR Factorization Over Runtime Systems 122
182 N = cores Enhancing parallelism Performance on a multicore node AGULLO - LANGOU QR Factorization Over Runtime Systems 123
183 M = cores Enhancing parallelism Performance on a multicore node AGULLO - LANGOU QR Factorization Over Runtime Systems 124
184 Enhancing parallelism M = N 3200 Performance on a multicore node AGULLO - LANGOU QR Factorization Over Runtime Systems 125
185 Enhancing parallelism Performance on an heterogeneous node Heterogeneous platform - double precision - N = 2* Gflop/s domains (binary) domains 8 domains 4 domains 2 domains 1 domain (flat) Number of tile rows AGULLO - LANGOU QR Factorization Over Runtime Systems 126
186 Outline Distributed Memory 1. QR factorization 2. Runtime System for multicore architectures 3. Using Accelerators 4. Enhancing parallelism 5. Distributed Memory AGULLO - LANGOU QR Factorization Over Runtime Systems 127
187 Distributed Memory Scalability - Cholesky 6 nodes ; 3 GPUs and 12 cores per node 7000 Speed (in GFlop/s) MPI nodes 4 MPI nodes MPI nodes 1 MPI node Matrix order AGULLO - LANGOU QR Factorization Over Runtime Systems 128
188 Distributed Memory Scalability - Cholesky Total of 3 GPUs and 12 cores AGULLO - LANGOU QR Factorization Over Runtime Systems 129
189 Distributed Memory Scalability: towards exascale? AGULLO - LANGOU QR Factorization Over Runtime Systems 130
Hierarchical QR factorization algorithms for multi-core cluster systems
Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationModeling and Tuning Parallel Performance in Dense Linear Algebra
Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationLAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering
More informationMinimizing Communication in Linear Algebra. James Demmel 15 June
Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on
More informationCS 594 Scientific Computing for Engineers Spring Dense Linear Algebra. Mark Gates
CS 594 Scientific Computing for Engineers Spring 2017 Dense Linear Algebra Mark Gates mgates3@icl.utk.edu http://www.icl.utk.edu/~mgates3/ 1 Outline Legacy Software BLAS LINPACK LAPACK ScaLAPACK New Software
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING
ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB What Is LINPACK? LINPACK is a package of mathematical
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationTable 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )
ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationMatrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)
Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) alfredo.buttari@enseeiht.fr for an up-to-date version of the slides: http://buttari.perso.enseeiht.fr Introduction Objective
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationCS 594 Scientific Computing for Engineers Spring Dense Linear Algebra. Mark Gates
CS 594 Scientific Computing for Engineers Spring 2018 Dense Linear Algebra Mark Gates mgates3@icl.utk.edu http://www.icl.utk.edu/~mgates3/ 1 Outline Legacy Software BLAS LINPACK LAPACK ScaLAPACK New Software
More informationTOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM
TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM AZZAM HAIDAR, HATEM LTAIEF, AND JACK DONGARRA Abstract. Classical solvers for the dense symmetric eigenvalue
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationPreconditioned Eigensolver LOBPCG in hypre and PETSc
Preconditioned Eigensolver LOBPCG in hypre and PETSc Ilya Lashuk, Merico Argentati, Evgueni Ovtchinnikov, and Andrew Knyazev Department of Mathematics, University of Colorado at Denver, P.O. Box 173364,
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou To cite this version: Emmanuel Agullo, Camille Coti,
More informationA Fully Empirical Autotuned Dense QR Factorization For Multicore Architectures
A Fully Empirical Autotuned Dense QR Factorization For Multicore Architectures Emmanuel Agullo, Jack Dongarra, Rajib Nath,Stanimire Tomov Theme : Distributed and High Performance Computing Équipes-Projets
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationarxiv: v1 [cs.ms] 18 Nov 2016
Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationA parallel tiled solver for dense symmetric indefinite systems on multicore architectures
A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris
More informationB629 project - StreamIt MPI Backend. Nilesh Mahajan
B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected
More informationOutline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations
Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions Recursion Automatic ariable blocking (e.g., QR factorization)
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems Marc Baboulin, Jack Dongarra, Rémi Lacroix To cite this version: Marc Baboulin, Jack Dongarra, Rémi Lacroix. Computing least squares
More informationParalleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures
Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Hurricane
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationReproducible Tall-Skinny QR
Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationParallel Numerical Linear Algebra
Parallel Numerical Linear Algebra 1 QR Factorization solving overconstrained linear systems tiled QR factorization using the PLASMA software library 2 Conjugate Gradient and Krylov Subspace Methods linear
More informationImplementation of a preconditioned eigensolver using Hypre
Implementation of a preconditioned eigensolver using Hypre Andrew V. Knyazev 1, and Merico E. Argentati 1 1 Department of Mathematics, University of Colorado at Denver, USA SUMMARY This paper describes
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationUsing Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods
Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationDivide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures
Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures Grégoire Pichon, Azzam Haidar, Mathieu Faverge, Jakub Kurzak To cite this version: Grégoire Pichon, Azzam Haidar, Mathieu
More informationDesign of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization
Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationAvoiding Communication in Linear Algebra
Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationSOLUTION of linear systems of equations of the form:
Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of
More information