Hierarchical QR factorization algorithms for multi-core cluster systems

Size: px

Start display at page:

Download "Hierarchical QR factorization algorithms for multi-core cluster systems"

Roxanne Kelly
5 years ago
Views:

1 Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado Denver, USA Ecole Normale Supérieure de Lyon, France JUNE 29, 2012

2 reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 2 of 103

3 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 3 of 103

4 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on The QR factoriza4on of a long and skinny matrix with its data par44oned ver4cally across several processors arises in a wide range of applica4ons. Input: A is block distributed by rows Output: Q is block distributed by rows R is global R A 1 Q 1 A 2 Q 2 A 3 Q 3 Julien Langou University of Colorado Denver Hierarchical QR 4 of 103

5 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Example of applica3ons: in block itera3ve methods. a) in itera3ve methods with mul3ple right hand sides (block itera4ve methods:) 1) Trilinos (Sandia Na4onal Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk). 2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, b) in itera3ve methods with a single right hand side 1) s step methods for linear systems of equa4ons (e.g. A. Chronopoulos), 2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc, 3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley). e) in itera3ve eigenvalue solvers, 1) PETSc (Argonne Na4onal Lab.) through BLOPEX (A. Knyazev, UCDHSC), 2) HYPRE (Lawrence Livermore Na4onal Lab.) through BLOPEX, 3) Trilinos (Sandia Na4onal Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), 4) PRIMME (A. Stathopoulos, Coll. William & Mary ), 5) And also TRLAN, BLZPACK, IRBLEIGS. Julien Langou University of Colorado Denver Hierarchical QR 5 of 103

6 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on Example of applica3ons: a) in linear least squares problems which the number of equa4ons is extremely larger than the number of unknowns b) in block itera4ve methods (itera4ve methods with mul4ple right hand sides or itera4ve eigenvalue solvers) c) in dense large and more square QR factoriza4on where they are used as the panel factoriza4on step Julien Langou University of Colorado Denver Hierarchical QR 6 of 103

7 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf LAPACK block QR (right looking): dgeqrf U R L A (1) V A (1) Panel factoriza4on dgek2 lu( ) dgeqf2 + dlarq qr( ) Update of the remaining submatrix dtrsm (+ dswp) \ dgemm dlarr L U A (2) V R A (2) Julien Langou University of Colorado Denver Hierarchical QR 7 of 103

8 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf U L A (1) Panel factoriza4on Update of the remaining submatrix dgek2 lu( ) dtrsm (+ dswp) \ dgemm Latency bounded: more than nb AllReduce for n*nb 2 ops CPU bandwidth bounded: the bulk of the computa4on: n*n*nb ops highly paralleliable, efficient and saclable. L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 8 of 103

9 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS U dgemm L A (1) dgek2 lu( ) dtrsm (+ dswp) \ dgemm L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 9 of 103

10 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS Parallelize the panel factoriza3on: Not an op4on in mul4core context ( p < 16 ) See e.g. ScaLAPACK or HPL but s4ll by far the slowest and the bojleneck of the computa4on. Hide the panel factoriza3on: Lookahead (see e.g. High Performance LINPACK) Dynamic Scheduling dgemm dgek2 lu( ) dgek2 lu( ) Julien Langou University of Colorado Denver Hierarchical QR 10 of 103

11 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Hiding the panel factoriza4on with dynamic scheduling. Time Courtesy from Alfredo Bujari, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 11 of 103

12 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? Julien Langou University of Colorado Denver Hierarchical QR 12 of 103

13 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 NB = 64 procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 13 of 103

14 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on in the MM, actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 14 of 103

15 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on (n 2 ) with the MM(n 3 ), actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 NEED FOR NEW MATHEMATICAL ALGORITHMS Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 15 of 103

16 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons Julien Langou University of Colorado Denver Hierarchical QR 16 of 103

17 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons New Algorithms (00 s) (mul4core friendly) Rely on a DAG/scheduler block data layout some extra kernels Those new algorithms have a very low granularity, they scale very well (mul4core, petascale compu4ng, ) removes a lots of dependencies among the tasks, (mul4core, distributed compu4ng) avoid latency (distributed compu4ng, out of core) rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. Julien Langou University of Colorado Denver Hierarchical QR 17 of 103

18 Tall and Skinny matrices Minimizing communication in parallel distributed (29) : New algorithms based on 2D par44onning: UTexas (van de Geijn): SYRK, CHOL (mul4core), LU, QR (out of core) UTennessee (Dongarra): CHOL (mul4core) HPC2N (Kågström)/IBM (Gustavson): Chol (Distributed) UCBerkeley (Demmel)/INRIA(Grigori): LU/QR (distributed) UCDenver (Langou): LU/QR (distributed) A 3 rd revolu4on for dense linear algebra? Julien Langou University of Colorado Denver Hierarchical QR 18 of 103

19 Tall and Skinny matrices Minimizing communication in parallel distributed (29) reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) We start with reduction communication in parallel distributed in the tall and skinny case. Julien Langou University of Colorado Denver Hierarchical QR 19 of 103

20 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes A 0 Q 0 processes A 1 Q 1 4me Julien Langou University of Colorado Denver Hierarchical QR 20 of 103

21 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 21 of 103

22 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) R (0) 0 ( ) R (0) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 22 of 103

23 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 23 of 103

24 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 Q 0 R A 1 Q 1 R processes A 2 A 3 R Q 2 R Q 3 A 4 Q 4 R A 5 Q 5 R A 6 Q 6 R A 4me QR Julien Langou University of Colorado Denver Hierarchical QR 24 of 103

25 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 25 of 103

26 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 26 of 103

27 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 27 of 103

QR factoriza3on and construc3on of T m = 10,000 Perf in MFLOP/sec (Times in sec) n DGEQR3 DGEQRF DGEQR2 50 173.6 (0.29) 65.0 (0.77) 64.6 (0.77) 100 240.5 (0.83) 62.

28 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Latency but also possibility of fast panel factoriza4on. DGEQR3 is the recursive algorithm (see Elmroth and Gustavson, 2000), DGEQRF and DGEQR2 are the LAPACK rou4nes. Times include QR and DLARFT. Run on Pen4um III. QR factoriza3on and construc3on of T m = 10,000 Perf in MFLOP/sec (Times in sec) n DGEQR3 DGEQRF DGEQR (0.29) 65.0 (0.77) 64.6 (0.77) (0.83) 62.6 (3.17) 65.3 (3.04) (1.60) 81.6 (5.46) 64.2 (6.94) (2.53) (7.09) 65.9 (11.98) MFLOP/sec m=1000,000, the x axis is n Julien Langou University of Colorado Denver Hierarchical QR 28 of 103

Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blue Gene L frost.ncar.edu Q and R: Strong scalability In this experiment, we fix the problem: m=1,000,000 and n=50.

29 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blue Gene L frost.ncar.edu Q and R: Strong scalability In this experiment, we fix the problem: m=1,000,000 and n=50. Then we increase the number of processors. MFLOPs/sec/proc # of processors ReduceHH (QR3) ReduceHH (QRF) ScaLAPACK QRF ScaLAPACK QR2 Julien Langou University of Colorado Denver Hierarchical QR 29 of 103

30 Tall and Skinny matrices Minimizing communication in parallel distributed (29) communication :: TSQR :: parallel case QR ( ) R (0) ( 0 ) When only R is wanted R (0) 0 QR ( R (0) 1 ) ( R (1) 0 ) R (1) 0 QR ( R (1) 2 ) ( R ) A 0 QR ( ) R (0) ( 1 ) A 1 processes QR ( ) R (0) ( 2 ) R (0) 2 QR ( R (0) 3 ) ( R (1) 2 ) Consider architecture: parallel case: P processing units 4me problem: QR factorization of a m-by-n TS matrix (TS = m/p n) (main) assumption: the operation is truly parallel distributed answer: binary tree theory: TSQR ScaLAPACK-like Lower bound # flops 2mn 2 P + 2n3 3 log P 2mn 2 2n3 P 3P A 2 QR ( ) A 3 R (0) ( 3 ) ( Θ n # words 2 2 log P n 2 2 log P n 2 2 log P # messages log P 2n log P log P mn 2 P ) Julien Langou University of Colorado Denver Hierarchical QR 30 of 103

31 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Any tree does. When only R is wanted: The MPI_Allreduce In the case where only R is wanted, instead of construc4ng our own tree, one can simply use MPI_Allreduce with a user defined opera4on. The opera4on we give to MPI is basically the Algorithm 2. It performs the opera4on: This binary opera4on is associa3ve and this is all MPI needs to use a user defined opera4on on a user defined datatype. Moreover, if we change the signs of the elements of R so that the diagonal of R holds posi4ve elements then the binary opera4on Rfactor becomes commuta3ve. The code becomes two lines: R QR ( 1 ) R R 2 lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info ); MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER, LILA_MPIOP_QR_UPPER, mpi_comm); Julien Langou University of Colorado Denver Hierarchical QR 31 of 103

32 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Binary tree Flat A weird tree Another weird tree parallelism parallelism parallelism parallelism +me +me +me +me Julien Langou University of Colorado Denver Hierarchical QR 32 of 103

33 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 33 of 103

34 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 Julien Langou University of Colorado Denver Hierarchical QR 34 of 103

35 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 35 of 103

36 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 36 of 103

37 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 37 of 103

38 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 38 of 103

39 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Using 4 clusters of 32 processors. Classical Gram Schmidt Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments model TSQR binary tree Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments 4 clusters, GFlops/sec model 2 clusters, GFlops/sec 10 2 m = 16.e cluster, GFlops/sec GFlop/sec 4 clusters, GFlops/sec 2 clusters, GFlops/sec m = 8.e+06 GFlop/sec m = 5.e+5 m = 2.e m 1 cluster, 9.27 GFlops/sec m Two effects at once: (1) avoiding communication with TSQR, (2) tuning of the reduction tree Julien Langou University of Colorado Denver Hierarchical QR 39 of 103

40 Tall and Skinny matrices Minimizing communication in sequential (1) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 40 of 103

Tall and Skinny matrices Minimizing communication in sequential (1) Consider architecture: sequential case: one processing unit with cache of size (W ) problem: QR factorization of a m-by-n TS matrix

41 Tall and Skinny matrices Minimizing communication in sequential (1) Consider architecture: sequential case: one processing unit with cache of size (W ) problem: QR factorization of a m-by-n TS matrix (TS = m n and W 3 2 n2 ) answer: flat tree theory: flat tree LAPACK-like Lower bound # flops 2mn 2 2mn 2 Θ(mn 2 ) # words 2mn # messages 3mn W m 2 n 2 2W mn 2 2W 2mn 2mn W Julien Langou University of Colorado Denver Hierarchical QR 41 of 103

42 Rectangular matrices Minimizing communication in sequential (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 42 of 103

43 Rectangular matrices Minimizing communication in sequential (2) 100 M fast = 10 6 M slow = 10 9 β = 10 8 γ = sequential case LU (or QR) factorization of square matrices 90 percent of CPU peak performance /2 n = M fast problem lower bound upper bound for CAQR flat tree lower bound for the LAPACK algorithm 1/2 n = M slow n (matrix size) Julien Langou University of Colorado Denver Hierarchical QR 43 of 103

44 Rectangular matrices Minimizing communication in sequential (2) sequential case LU (or QR) factorization of square matrices ( # of slow memory references ) / (problem lower bound) M fast = 10 3 lower bound for the LAPACK algorithm upper bound for CAQR flat tree M fast = 10 6 M fast = n size of the matrix Julien Langou University of Colorado Denver Hierarchical QR 44 of 103

45 Rectangular matrices Maximizing parallelism on multicore nodes (30) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 45 of 103

46 Rectangular matrices Maximizing parallelism on multicore nodes (30) Mission Perform the QR factorization of an initial tiled matrix A on a multicore platform. time Julien Langou University of Colorado Denver Hierarchical QR 46 of 103

47 Rectangular matrices Maximizing parallelism on multicore nodes (30) for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 47 of 103

48 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 48 of 103

49 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 49 of 103

50 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 50 of 103

51 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 51 of 103

52 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 52 of 103

53 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 53 of 103

54 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 54 of 103

55 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 55 of 103

56 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 56 of 103

57 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 57 of 103

dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n

58 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 dssrb 1,1 dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 58 of 103

59 Rectangular matrices Maximizing parallelism on multicore nodes (30) DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DGEQRT DSSRFB DSSRFB DLARFB DTSQRT DSSRFB DGEQRT DAG for 5x5 matrix (QR factorization) Julien Langou University of Colorado Denver Hierarchical QR 59 of 103

60 are Rectangular matrices Maximizing parallelism on multicore nodes (30) A. 1:1 2:20 3:20 Fig. 9. 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 12:74 13:75 14:90 15:90 16:91 17:105 18:105 TilI 19:106 20:119 21:118 22:100 23:111 24:93 25:93 26:87 27:86 28:71 29:80 30:65 31:65 32:60 33:59 34:47 35:54 36:42 37:42 38:38 39:37 40:28 41:33 42:24 43:24 44:21 45:20 46:14 47:17 48:11 49:11 50:9 51:8 52:5 53:6 54:3 55:3 56:2 57:1 58:1 DAG for 20x20 matrix (QR factorization) Fig. 10. DAG for a LU factorization with 20 tiles (block size 200 and matrix size 4000). The size of the DAG grows very fast with the number of tiles. 1:1 2:20 3:20 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 more tasks until some are completed. The usage of a window of tasks has implications in how the loops of an application are unfolded and how much look ahead is 12:74 13:75 14:90 15:90 16:91 17:105 18:105 19:106 20:119 21:118 22:100 23:111 24:93 25:93 Julien Langou 26:87 27:86 University of Colorado Denver Hierarchical QR arch qua EM theo Gfl pra is e the div cac two has The 60 of 103

61 Rectangular matrices Maximizing parallelism on multicore nodes (30) Gflop/s Tile QR Factorization 3.2 GHz CELL Processor 20 SSSRFB Peak Tile QR Matrix Size Performance of the tile QR factorization Figure in single5: precision Performance on aof 3.2the GHz tilecell QR factorization processor with in single eight SPEs. precision Square on a 3.2 matrices were used. Solid horizontal line marks performance of the SSSRFB kernel GHz times CELL theprocessor number of with SPEs eight (22.16 SPEs. Square 8 = 177 matrices [Gflop/s]). were used. The solid The presented implementation of tile QR horizontal factorization line on marks the CELL performance processor of the allows SSSRFB for factorization kernel times of athe 4000 by 4000 number dense matrix in single precision in exactly half of a second. To the author s of knowledge, SPEs (22.16 at present, 8 = 177 it is [Gflop/s]). the fastest reported time of solving such problem by any semiconductor device implemented on a single semiconductor die. Jakub Kurzak and Jack Dongarra, LAWN 201 QR Factorization for the CELL 28 Processor, May Julien Langou University of Colorado Denver Hierarchical QR 61 of 103

62 Rectangular matrices Maximizing parallelism on multicore nodes (30) Perform the QR factorization of an initial tiled matrix A. time Our tool: Givens rotations: ( cos θ sin θ ) sin θ cos θ Julien Langou University of Colorado Denver Hierarchical QR 62 of 103

63 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 First column: The goal is to introduce zeros using Givens rotations. Only the top entry shall stay. At each step, this eliminates this. (This is a reduction.) time The first step requires seven computing units, the second four, the third two, and the fourth one. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

64 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

65 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

66 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

67 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

68 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

69 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

70 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

71 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

72 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

73 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Note: one point of the plasma tree is to (1) enhance parallelism in the rectangular case, (2) increase data locality (when working within a domain). Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

74 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

75 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 8 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

76 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

77 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

78 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 6 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

79 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

80 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

81 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Flat tree - Sameh and Kuck (1978). Plasma tree - Hadri et al. (2010). Note: that binomial tree corresponds to bs = 1, and flat tree to bs = p. Greedy - Modi and Clarke (1984), Cosnard and Robert (1986). Cosnard and Robert (1986) proved that no matter the shape of the matrix (i.e., p and q), Greedy was optimal. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

82 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 64 of 103

83 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 65 of 103

84 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 66 of 103

85 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 67 of 103

86 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 1: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

87 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

88 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

89 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

90 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

91 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square 0 Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

92 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

93 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

94 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

95 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

96 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

97 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

98 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 0 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

99 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Note: it is understood that if a tile is already in triangle form, then the associated GEQRT and update kernels don t need to be applied. Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

100 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

101 Rectangular matrices Maximizing parallelism on multicore nodes (30) The analysis and coding is a little more complex... time Julien Langou University of Colorado Denver Hierarchical QR 69 of 103

102 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem No matter what elimination list (any combination of TT, TS) is used the total weight of the tasks for performing a tiled QR factorization algorithm is constant and equal to 6pq 2 2q 3. Proof: L 1 :: GEQRT = TTQRT + q L 2 :: UNMQRT = TTMQRT + (1/2)q(q 1) L 3 :: TTQRT + TSQRT = pq (1/2)q(q + 1) L 4 :: TTMQR + TSMQR = (1/2)pq(q 1) (1/6)q(q 1)(q + 1) Define L 5 as 4L 1 + 6L 2 + 6L L 4 and we get L 5 :: 6pq 2 2q 3 = 4GEQRT + 12TSMQRT + 6TTMQRT + 2TTQRT + 6TSQRT + 6UNMQRT L 5 :: = total # of flops Note: Using our unit task weight of b 3 /3, with m = pb, and n = qb, we obtain 2mn 2 2/3n 3 flops which is the exact same number as for a standard Householder reflection algorithm as found in LAPACK or ScaLAPACK. Julien Langou University of Colorado Denver Hierarchical QR 70 of 103

103 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. The critical path length of FLATTREE is 2p + 2 if p q = 1 6p + 16q 22 if p > q > 1 22p 24 if p = q > 1 time nd Row 3 rd Row 4 th Row 2 nd Column 3 nd Column Initial: 10 Fill the pipeline: 6(p 1) Pipeline: 16(q 2) End: 4 Julien Langou University of Colorado Denver Hierarchical QR 71 of 103

104 Rectangular matrices Maximizing parallelism on multicore nodes (30) Greedy tree on a 15x4 matrix. (Weighted on top, coarse below.) time time Julien Langou University of Colorado Denver Hierarchical QR 72 of 103

105 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. Given the time for any coarse grain algorithm, COARSE(p,q), the corresponding weighted algorithm time WEIGHTED(p,q) is 10 (q 1) + 6 COARSE(p, q 1) WEIGHTED(p, q) 10 (q 1)+6 COARSE(p, q 1)+4+2 (COARSE(p, q) COARSE(p, q 1)) Julien Langou University of Colorado Denver Hierarchical QR 73 of 103

106 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem We can prove that the GRASAP algorithm is optimal. Julien Langou University of Colorado Denver Hierarchical QR 74 of 103

107 Rectangular matrices Maximizing parallelism on multicore nodes (30) Platform All experiments were performed on a 48-core machine composed of eight hexa-core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of Gflop/s for the whole machine. Experimental code is written using the QUARK scheduler. Results are tested with I Q T Q and A QR / A. Code has been written in real and complex arithmetic, single and double precision. Julien Langou University of Colorado Denver Hierarchical QR 75 of 103

108 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 76 of 103

109 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Experimental) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in time with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 77 of 103

110 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) Overhead in time with respect to GREEDY (GREEDY = 1) q Main difference is on the right where there is enough parallelism so all method give peak performance. (peak is TTMQRT performance) q Julien Langou University of Colorado Denver Hierarchical QR 78 of 103

111 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200) 160 Best domain size for PLASMATREE (TT) = [ ] GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Julien Langou University of Colorado Denver Hierarchical QR 79 of 103

112 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 20 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q q Julien Langou University of Colorado Denver Hierarchical QR 80 of 103

113 Rectangular matrices Parallelism+communication on multicore nodes (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 81 of 103

114 Rectangular matrices Parallelism+communication on multicore nodes (2) Factor Kernels code name # flops GEQRT 4 TTQRT 2 TSQRT 6 We work on square b-by-b tiles. The unit for the task weights is b 3 /3. Each of our tasks is O(b 3 ) computation for O(b 2 ) communication. Thanks to this surface/volume effect, we can neglect in a first approximation communication and focus only on the parallelism. Julien Langou University of Colorado Denver Hierarchical QR 82 of 103

115 Rectangular matrices Parallelism+communication on multicore nodes (2) Using TT Kernels GEQRT GEQRT TTQRT Total # of flops: (sequential time) 2 * GEQRT + TTQRT = 2 * = 10 Critical Path Length: (parallel time) GEQRT + TTQRT = = 6 Using TS Kernels GEQRT TSQRT Total # of flops: (sequential time) GEQRT + TSQRT = = 10 Critical Path Length: (parallel time) GEQRT + TSQRT = = 10 One remarkable outcome is that the total number of flops is 10b 3 /3. If we consider the number of flops for a standard LAPACK algorithm (based on standard Householder transformation), we would also obtain 10b 3 /3. Julien Langou University of Colorado Denver Hierarchical QR 83 of 103

116 Rectangular matrices Parallelism+communication on multicore nodes (2) GFLOP/s TSQRT (in) GEQRT (in) TTQRT (in) GEQRT +TTQRT (in) GEMM (in) TSQRT (out) GEQRT (out) TTQRT (out) GEQRT +TTQRT (out) GEMM (out) tile size The TS kernels (factorization and update) are more efficient than the TT ones due a variety of reasons. (Better vectorization, better volume-to-surface ratio.) Julien Langou University of Colorado Denver Hierarchical QR 84 of 103

117 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, Model on left, Experimental on right) 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 50 FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Motivates the introduction of a TS-level. Square 1-by-1 tiles are grouped in a larger rectangular tiles of size a-by-1 which is eliminated in a TS fashion. The regular TT algorithm is then applied on the rectangular tiles. q pros: recover performance of TS level cons: introduce a tuning parameter (a) Julien Langou University of Colorado Denver Hierarchical QR 85 of 103

118 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, performance on left, overhead on right) 250 P=40 MB=200,IB=32,cores = 48 2 P=40 MB=200,IB=32,cores = 48 Gflop/sec overhead wrt BEST BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a=8 50 BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a= Q Q Idea: use TS domain of size a, use greedy on top of them, use round-robin on the domains (This experiment was done with DAGuE.) Julien Langou University of Colorado Denver Hierarchical QR 86 of 103

119 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 87 of 103

120 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Three goals: 1. minimize the number of local communication 2. minimize the number of parallel distributed communication 3. maximize the parallelism within a node Julien Langou University of Colorado Denver Hierarchical QR 88 of 103

121 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

122 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

123 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

layout Legend P 0 P 1 P 2 0 local TS 1 local TT 2 domino 3

124 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical algorithm layout Legend P 0 P 1 P 2 0 local TS 1 local TT 2 domino 3 global tree Julien Langou University of Colorado Denver Hierarchical QR 90 of 103

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University