Hierarchical QR factorization algorithms for multi-core cluster systems
|
|
- Roxanne Kelly
- 5 years ago
- Views:
Transcription
1 Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado Denver, USA Ecole Normale Supérieure de Lyon, France JUNE 29, 2012
2 reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 2 of 103
3 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 3 of 103
4 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on The QR factoriza4on of a long and skinny matrix with its data par44oned ver4cally across several processors arises in a wide range of applica4ons. Input: A is block distributed by rows Output: Q is block distributed by rows R is global R A 1 Q 1 A 2 Q 2 A 3 Q 3 Julien Langou University of Colorado Denver Hierarchical QR 4 of 103
5 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Example of applica3ons: in block itera3ve methods. a) in itera3ve methods with mul3ple right hand sides (block itera4ve methods:) 1) Trilinos (Sandia Na4onal Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk). 2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, b) in itera3ve methods with a single right hand side 1) s step methods for linear systems of equa4ons (e.g. A. Chronopoulos), 2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc, 3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley). e) in itera3ve eigenvalue solvers, 1) PETSc (Argonne Na4onal Lab.) through BLOPEX (A. Knyazev, UCDHSC), 2) HYPRE (Lawrence Livermore Na4onal Lab.) through BLOPEX, 3) Trilinos (Sandia Na4onal Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), 4) PRIMME (A. Stathopoulos, Coll. William & Mary ), 5) And also TRLAN, BLZPACK, IRBLEIGS. Julien Langou University of Colorado Denver Hierarchical QR 5 of 103
6 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on Example of applica3ons: a) in linear least squares problems which the number of equa4ons is extremely larger than the number of unknowns b) in block itera4ve methods (itera4ve methods with mul4ple right hand sides or itera4ve eigenvalue solvers) c) in dense large and more square QR factoriza4on where they are used as the panel factoriza4on step Julien Langou University of Colorado Denver Hierarchical QR 6 of 103
7 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf LAPACK block QR (right looking): dgeqrf U R L A (1) V A (1) Panel factoriza4on dgek2 lu( ) dgeqf2 + dlarq qr( ) Update of the remaining submatrix dtrsm (+ dswp) \ dgemm dlarr L U A (2) V R A (2) Julien Langou University of Colorado Denver Hierarchical QR 7 of 103
8 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf U L A (1) Panel factoriza4on Update of the remaining submatrix dgek2 lu( ) dtrsm (+ dswp) \ dgemm Latency bounded: more than nb AllReduce for n*nb 2 ops CPU bandwidth bounded: the bulk of the computa4on: n*n*nb ops highly paralleliable, efficient and saclable. L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 8 of 103
9 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS U dgemm L A (1) dgek2 lu( ) dtrsm (+ dswp) \ dgemm L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 9 of 103
10 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS Parallelize the panel factoriza3on: Not an op4on in mul4core context ( p < 16 ) See e.g. ScaLAPACK or HPL but s4ll by far the slowest and the bojleneck of the computa4on. Hide the panel factoriza3on: Lookahead (see e.g. High Performance LINPACK) Dynamic Scheduling dgemm dgek2 lu( ) dgek2 lu( ) Julien Langou University of Colorado Denver Hierarchical QR 10 of 103
11 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Hiding the panel factoriza4on with dynamic scheduling. Time Courtesy from Alfredo Bujari, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 11 of 103
12 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? Julien Langou University of Colorado Denver Hierarchical QR 12 of 103
13 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 NB = 64 procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 13 of 103
14 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on in the MM, actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 14 of 103
15 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on (n 2 ) with the MM(n 3 ), actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 NEED FOR NEW MATHEMATICAL ALGORITHMS Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 15 of 103
16 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons Julien Langou University of Colorado Denver Hierarchical QR 16 of 103
17 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons New Algorithms (00 s) (mul4core friendly) Rely on a DAG/scheduler block data layout some extra kernels Those new algorithms have a very low granularity, they scale very well (mul4core, petascale compu4ng, ) removes a lots of dependencies among the tasks, (mul4core, distributed compu4ng) avoid latency (distributed compu4ng, out of core) rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. Julien Langou University of Colorado Denver Hierarchical QR 17 of 103
18 Tall and Skinny matrices Minimizing communication in parallel distributed (29) : New algorithms based on 2D par44onning: UTexas (van de Geijn): SYRK, CHOL (mul4core), LU, QR (out of core) UTennessee (Dongarra): CHOL (mul4core) HPC2N (Kågström)/IBM (Gustavson): Chol (Distributed) UCBerkeley (Demmel)/INRIA(Grigori): LU/QR (distributed) UCDenver (Langou): LU/QR (distributed) A 3 rd revolu4on for dense linear algebra? Julien Langou University of Colorado Denver Hierarchical QR 18 of 103
19 Tall and Skinny matrices Minimizing communication in parallel distributed (29) reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) We start with reduction communication in parallel distributed in the tall and skinny case. Julien Langou University of Colorado Denver Hierarchical QR 19 of 103
20 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes A 0 Q 0 processes A 1 Q 1 4me Julien Langou University of Colorado Denver Hierarchical QR 20 of 103
21 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 21 of 103
22 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) R (0) 0 ( ) R (0) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 22 of 103
23 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 23 of 103
24 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 Q 0 R A 1 Q 1 R processes A 2 A 3 R Q 2 R Q 3 A 4 Q 4 R A 5 Q 5 R A 6 Q 6 R A 4me QR Julien Langou University of Colorado Denver Hierarchical QR 24 of 103
25 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 25 of 103
26 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 26 of 103
27 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 27 of 103
28 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Latency but also possibility of fast panel factoriza4on. DGEQR3 is the recursive algorithm (see Elmroth and Gustavson, 2000), DGEQRF and DGEQR2 are the LAPACK rou4nes. Times include QR and DLARFT. Run on Pen4um III. QR factoriza3on and construc3on of T m = 10,000 Perf in MFLOP/sec (Times in sec) n DGEQR3 DGEQRF DGEQR (0.29) 65.0 (0.77) 64.6 (0.77) (0.83) 62.6 (3.17) 65.3 (3.04) (1.60) 81.6 (5.46) 64.2 (6.94) (2.53) (7.09) 65.9 (11.98) MFLOP/sec m=1000,000, the x axis is n Julien Langou University of Colorado Denver Hierarchical QR 28 of 103
29 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blue Gene L frost.ncar.edu Q and R: Strong scalability In this experiment, we fix the problem: m=1,000,000 and n=50. Then we increase the number of processors. MFLOPs/sec/proc # of processors ReduceHH (QR3) ReduceHH (QRF) ScaLAPACK QRF ScaLAPACK QR2 Julien Langou University of Colorado Denver Hierarchical QR 29 of 103
30 Tall and Skinny matrices Minimizing communication in parallel distributed (29) communication :: TSQR :: parallel case QR ( ) R (0) ( 0 ) When only R is wanted R (0) 0 QR ( R (0) 1 ) ( R (1) 0 ) R (1) 0 QR ( R (1) 2 ) ( R ) A 0 QR ( ) R (0) ( 1 ) A 1 processes QR ( ) R (0) ( 2 ) R (0) 2 QR ( R (0) 3 ) ( R (1) 2 ) Consider architecture: parallel case: P processing units 4me problem: QR factorization of a m-by-n TS matrix (TS = m/p n) (main) assumption: the operation is truly parallel distributed answer: binary tree theory: TSQR ScaLAPACK-like Lower bound # flops 2mn 2 P + 2n3 3 log P 2mn 2 2n3 P 3P A 2 QR ( ) A 3 R (0) ( 3 ) ( Θ n # words 2 2 log P n 2 2 log P n 2 2 log P # messages log P 2n log P log P mn 2 P ) Julien Langou University of Colorado Denver Hierarchical QR 30 of 103
31 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Any tree does. When only R is wanted: The MPI_Allreduce In the case where only R is wanted, instead of construc4ng our own tree, one can simply use MPI_Allreduce with a user defined opera4on. The opera4on we give to MPI is basically the Algorithm 2. It performs the opera4on: This binary opera4on is associa3ve and this is all MPI needs to use a user defined opera4on on a user defined datatype. Moreover, if we change the signs of the elements of R so that the diagonal of R holds posi4ve elements then the binary opera4on Rfactor becomes commuta3ve. The code becomes two lines: R QR ( 1 ) R R 2 lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info ); MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER, LILA_MPIOP_QR_UPPER, mpi_comm); Julien Langou University of Colorado Denver Hierarchical QR 31 of 103
32 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Binary tree Flat A weird tree Another weird tree parallelism parallelism parallelism parallelism +me +me +me +me Julien Langou University of Colorado Denver Hierarchical QR 32 of 103
33 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 33 of 103
34 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 Julien Langou University of Colorado Denver Hierarchical QR 34 of 103
35 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 35 of 103
36 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 36 of 103
37 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 37 of 103
38 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 38 of 103
39 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Using 4 clusters of 32 processors. Classical Gram Schmidt Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments model TSQR binary tree Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments 4 clusters, GFlops/sec model 2 clusters, GFlops/sec 10 2 m = 16.e cluster, GFlops/sec GFlop/sec 4 clusters, GFlops/sec 2 clusters, GFlops/sec m = 8.e+06 GFlop/sec m = 5.e+5 m = 2.e m 1 cluster, 9.27 GFlops/sec m Two effects at once: (1) avoiding communication with TSQR, (2) tuning of the reduction tree Julien Langou University of Colorado Denver Hierarchical QR 39 of 103
40 Tall and Skinny matrices Minimizing communication in sequential (1) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 40 of 103
41 Tall and Skinny matrices Minimizing communication in sequential (1) Consider architecture: sequential case: one processing unit with cache of size (W ) problem: QR factorization of a m-by-n TS matrix (TS = m n and W 3 2 n2 ) answer: flat tree theory: flat tree LAPACK-like Lower bound # flops 2mn 2 2mn 2 Θ(mn 2 ) # words 2mn # messages 3mn W m 2 n 2 2W mn 2 2W 2mn 2mn W Julien Langou University of Colorado Denver Hierarchical QR 41 of 103
42 Rectangular matrices Minimizing communication in sequential (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 42 of 103
43 Rectangular matrices Minimizing communication in sequential (2) 100 M fast = 10 6 M slow = 10 9 β = 10 8 γ = sequential case LU (or QR) factorization of square matrices 90 percent of CPU peak performance /2 n = M fast problem lower bound upper bound for CAQR flat tree lower bound for the LAPACK algorithm 1/2 n = M slow n (matrix size) Julien Langou University of Colorado Denver Hierarchical QR 43 of 103
44 Rectangular matrices Minimizing communication in sequential (2) sequential case LU (or QR) factorization of square matrices ( # of slow memory references ) / (problem lower bound) M fast = 10 3 lower bound for the LAPACK algorithm upper bound for CAQR flat tree M fast = 10 6 M fast = n size of the matrix Julien Langou University of Colorado Denver Hierarchical QR 44 of 103
45 Rectangular matrices Maximizing parallelism on multicore nodes (30) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 45 of 103
46 Rectangular matrices Maximizing parallelism on multicore nodes (30) Mission Perform the QR factorization of an initial tiled matrix A on a multicore platform. time Julien Langou University of Colorado Denver Hierarchical QR 46 of 103
47 Rectangular matrices Maximizing parallelism on multicore nodes (30) for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 47 of 103
48 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 48 of 103
49 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 49 of 103
50 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 50 of 103
51 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 51 of 103
52 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 52 of 103
53 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 53 of 103
54 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 54 of 103
55 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 55 of 103
56 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 56 of 103
57 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 57 of 103
58 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 dssrb 1,1 dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 58 of 103
59 Rectangular matrices Maximizing parallelism on multicore nodes (30) DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DGEQRT DSSRFB DSSRFB DLARFB DTSQRT DSSRFB DGEQRT DAG for 5x5 matrix (QR factorization) Julien Langou University of Colorado Denver Hierarchical QR 59 of 103
60 are Rectangular matrices Maximizing parallelism on multicore nodes (30) A. 1:1 2:20 3:20 Fig. 9. 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 12:74 13:75 14:90 15:90 16:91 17:105 18:105 TilI 19:106 20:119 21:118 22:100 23:111 24:93 25:93 26:87 27:86 28:71 29:80 30:65 31:65 32:60 33:59 34:47 35:54 36:42 37:42 38:38 39:37 40:28 41:33 42:24 43:24 44:21 45:20 46:14 47:17 48:11 49:11 50:9 51:8 52:5 53:6 54:3 55:3 56:2 57:1 58:1 DAG for 20x20 matrix (QR factorization) Fig. 10. DAG for a LU factorization with 20 tiles (block size 200 and matrix size 4000). The size of the DAG grows very fast with the number of tiles. 1:1 2:20 3:20 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 more tasks until some are completed. The usage of a window of tasks has implications in how the loops of an application are unfolded and how much look ahead is 12:74 13:75 14:90 15:90 16:91 17:105 18:105 19:106 20:119 21:118 22:100 23:111 24:93 25:93 Julien Langou 26:87 27:86 University of Colorado Denver Hierarchical QR arch qua EM theo Gfl pra is e the div cac two has The 60 of 103
61 Rectangular matrices Maximizing parallelism on multicore nodes (30) Gflop/s Tile QR Factorization 3.2 GHz CELL Processor 20 SSSRFB Peak Tile QR Matrix Size Performance of the tile QR factorization Figure in single5: precision Performance on aof 3.2the GHz tilecell QR factorization processor with in single eight SPEs. precision Square on a 3.2 matrices were used. Solid horizontal line marks performance of the SSSRFB kernel GHz times CELL theprocessor number of with SPEs eight (22.16 SPEs. Square 8 = 177 matrices [Gflop/s]). were used. The solid The presented implementation of tile QR horizontal factorization line on marks the CELL performance processor of the allows SSSRFB for factorization kernel times of athe 4000 by 4000 number dense matrix in single precision in exactly half of a second. To the author s of knowledge, SPEs (22.16 at present, 8 = 177 it is [Gflop/s]). the fastest reported time of solving such problem by any semiconductor device implemented on a single semiconductor die. Jakub Kurzak and Jack Dongarra, LAWN 201 QR Factorization for the CELL 28 Processor, May Julien Langou University of Colorado Denver Hierarchical QR 61 of 103
62 Rectangular matrices Maximizing parallelism on multicore nodes (30) Perform the QR factorization of an initial tiled matrix A. time Our tool: Givens rotations: ( cos θ sin θ ) sin θ cos θ Julien Langou University of Colorado Denver Hierarchical QR 62 of 103
63 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 First column: The goal is to introduce zeros using Givens rotations. Only the top entry shall stay. At each step, this eliminates this. (This is a reduction.) time The first step requires seven computing units, the second four, the third two, and the fourth one. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
64 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
65 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
66 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
67 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
68 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
69 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
70 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
71 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
72 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
73 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Note: one point of the plasma tree is to (1) enhance parallelism in the rectangular case, (2) increase data locality (when working within a domain). Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
74 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
75 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 8 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
76 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
77 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
78 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 6 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
79 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
80 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
81 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Flat tree - Sameh and Kuck (1978). Plasma tree - Hadri et al. (2010). Note: that binomial tree corresponds to bs = 1, and flat tree to bs = p. Greedy - Modi and Clarke (1984), Cosnard and Robert (1986). Cosnard and Robert (1986) proved that no matter the shape of the matrix (i.e., p and q), Greedy was optimal. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103
82 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 64 of 103
83 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 65 of 103
84 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 66 of 103
85 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 67 of 103
86 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 1: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
87 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
88 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
89 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
90 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
91 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square 0 Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
92 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
93 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
94 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
95 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
96 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
97 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
98 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 0 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
99 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Note: it is understood that if a tile is already in triangle form, then the associated GEQRT and update kernels don t need to be applied. Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
100 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103
101 Rectangular matrices Maximizing parallelism on multicore nodes (30) The analysis and coding is a little more complex... time Julien Langou University of Colorado Denver Hierarchical QR 69 of 103
102 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem No matter what elimination list (any combination of TT, TS) is used the total weight of the tasks for performing a tiled QR factorization algorithm is constant and equal to 6pq 2 2q 3. Proof: L 1 :: GEQRT = TTQRT + q L 2 :: UNMQRT = TTMQRT + (1/2)q(q 1) L 3 :: TTQRT + TSQRT = pq (1/2)q(q + 1) L 4 :: TTMQR + TSMQR = (1/2)pq(q 1) (1/6)q(q 1)(q + 1) Define L 5 as 4L 1 + 6L 2 + 6L L 4 and we get L 5 :: 6pq 2 2q 3 = 4GEQRT + 12TSMQRT + 6TTMQRT + 2TTQRT + 6TSQRT + 6UNMQRT L 5 :: = total # of flops Note: Using our unit task weight of b 3 /3, with m = pb, and n = qb, we obtain 2mn 2 2/3n 3 flops which is the exact same number as for a standard Householder reflection algorithm as found in LAPACK or ScaLAPACK. Julien Langou University of Colorado Denver Hierarchical QR 70 of 103
103 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. The critical path length of FLATTREE is 2p + 2 if p q = 1 6p + 16q 22 if p > q > 1 22p 24 if p = q > 1 time nd Row 3 rd Row 4 th Row 2 nd Column 3 nd Column Initial: 10 Fill the pipeline: 6(p 1) Pipeline: 16(q 2) End: 4 Julien Langou University of Colorado Denver Hierarchical QR 71 of 103
104 Rectangular matrices Maximizing parallelism on multicore nodes (30) Greedy tree on a 15x4 matrix. (Weighted on top, coarse below.) time time Julien Langou University of Colorado Denver Hierarchical QR 72 of 103
105 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. Given the time for any coarse grain algorithm, COARSE(p,q), the corresponding weighted algorithm time WEIGHTED(p,q) is 10 (q 1) + 6 COARSE(p, q 1) WEIGHTED(p, q) 10 (q 1)+6 COARSE(p, q 1)+4+2 (COARSE(p, q) COARSE(p, q 1)) Julien Langou University of Colorado Denver Hierarchical QR 73 of 103
106 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem We can prove that the GRASAP algorithm is optimal. Julien Langou University of Colorado Denver Hierarchical QR 74 of 103
107 Rectangular matrices Maximizing parallelism on multicore nodes (30) Platform All experiments were performed on a 48-core machine composed of eight hexa-core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of Gflop/s for the whole machine. Experimental code is written using the QUARK scheduler. Results are tested with I Q T Q and A QR / A. Code has been written in real and complex arithmetic, single and double precision. Julien Langou University of Colorado Denver Hierarchical QR 75 of 103
108 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 76 of 103
109 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Experimental) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in time with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 77 of 103
110 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) Overhead in time with respect to GREEDY (GREEDY = 1) q Main difference is on the right where there is enough parallelism so all method give peak performance. (peak is TTMQRT performance) q Julien Langou University of Colorado Denver Hierarchical QR 78 of 103
111 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200) 160 Best domain size for PLASMATREE (TT) = [ ] GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Julien Langou University of Colorado Denver Hierarchical QR 79 of 103
112 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 20 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q q Julien Langou University of Colorado Denver Hierarchical QR 80 of 103
113 Rectangular matrices Parallelism+communication on multicore nodes (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 81 of 103
114 Rectangular matrices Parallelism+communication on multicore nodes (2) Factor Kernels code name # flops GEQRT 4 TTQRT 2 TSQRT 6 We work on square b-by-b tiles. The unit for the task weights is b 3 /3. Each of our tasks is O(b 3 ) computation for O(b 2 ) communication. Thanks to this surface/volume effect, we can neglect in a first approximation communication and focus only on the parallelism. Julien Langou University of Colorado Denver Hierarchical QR 82 of 103
115 Rectangular matrices Parallelism+communication on multicore nodes (2) Using TT Kernels GEQRT GEQRT TTQRT Total # of flops: (sequential time) 2 * GEQRT + TTQRT = 2 * = 10 Critical Path Length: (parallel time) GEQRT + TTQRT = = 6 Using TS Kernels GEQRT TSQRT Total # of flops: (sequential time) GEQRT + TSQRT = = 10 Critical Path Length: (parallel time) GEQRT + TSQRT = = 10 One remarkable outcome is that the total number of flops is 10b 3 /3. If we consider the number of flops for a standard LAPACK algorithm (based on standard Householder transformation), we would also obtain 10b 3 /3. Julien Langou University of Colorado Denver Hierarchical QR 83 of 103
116 Rectangular matrices Parallelism+communication on multicore nodes (2) GFLOP/s TSQRT (in) GEQRT (in) TTQRT (in) GEQRT +TTQRT (in) GEMM (in) TSQRT (out) GEQRT (out) TTQRT (out) GEQRT +TTQRT (out) GEMM (out) tile size The TS kernels (factorization and update) are more efficient than the TT ones due a variety of reasons. (Better vectorization, better volume-to-surface ratio.) Julien Langou University of Colorado Denver Hierarchical QR 84 of 103
117 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, Model on left, Experimental on right) 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 50 FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Motivates the introduction of a TS-level. Square 1-by-1 tiles are grouped in a larger rectangular tiles of size a-by-1 which is eliminated in a TS fashion. The regular TT algorithm is then applied on the rectangular tiles. q pros: recover performance of TS level cons: introduce a tuning parameter (a) Julien Langou University of Colorado Denver Hierarchical QR 85 of 103
118 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, performance on left, overhead on right) 250 P=40 MB=200,IB=32,cores = 48 2 P=40 MB=200,IB=32,cores = 48 Gflop/sec overhead wrt BEST BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a=8 50 BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a= Q Q Idea: use TS domain of size a, use greedy on top of them, use round-robin on the domains (This experiment was done with DAGuE.) Julien Langou University of Colorado Denver Hierarchical QR 86 of 103
119 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 87 of 103
120 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Three goals: 1. minimize the number of local communication 2. minimize the number of parallel distributed communication 3. maximize the parallelism within a node Julien Langou University of Colorado Denver Hierarchical QR 88 of 103
121 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103
122 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103
123 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103
124 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical algorithm layout Legend P 0 P 1 P 2 0 local TS 1 local TT 2 domino 3 global tree Julien Langou University of Colorado Denver Hierarchical QR 90 of 103
Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems
Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationarxiv: v1 [cs.ms] 18 Nov 2016
Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,
More informationModeling and Tuning Parallel Performance in Dense Linear Algebra
Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationTiled QR factorization algorithms
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Tiled QR factorization algorithms Henricus Bouwmeester Mathias Jacuelin Julien Langou Yves Robert N 7601 Avril 2011 Distributed and High
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationTiled QR factorization algorithms
Tiled QR factorization algorithms Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert To cite this version: Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert. Tiled QR factorization
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationMinimizing Communication in Linear Algebra. James Demmel 15 June
Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on
More informationMatrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)
Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) alfredo.buttari@enseeiht.fr for an up-to-date version of the slides: http://buttari.perso.enseeiht.fr Introduction Objective
More informationLAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationOverview of Dense Numerical Linear Algebra Libraries
Overview of Dense Numerical Linear Algebra Libraries BLAS: kernel for dense linear algebra LAPACK: sequen;al dense linear algebra ScaLAPACK: parallel distributed dense linear algebra And Beyond Linear
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou To cite this version: Emmanuel Agullo, Camille Coti,
More informationTable 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )
ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationPerformance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank
More informationParallelizing Gaussian Process Calcula1ons in R
Parallelizing Gaussian Process Calcula1ons in R Christopher Paciorek UC Berkeley Sta1s1cs Joint work with: Benjamin Lipshitz Wei Zhuo Prabhat Cari Kaufman Rollin Thomas UC Berkeley EECS (formerly) IBM
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationA short note on the Householder QR factorization
A short note on the Householder QR factorization Alfredo Buttari October 25, 2016 1 Uses of the QR factorization This document focuses on the QR factorization of a dense matrix. This method decomposes
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationA parallel tiled solver for dense symmetric indefinite systems on multicore architectures
A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationThe Future of LAPACK and ScaLAPACK
The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationCommunication-avoiding Krylov subspace methods
Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationParallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179
Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179 Marc Baboulin Luc Giraud Serge Gratton Julien Langou September 19, 2006 Abstract
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationPerformance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor
Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationA Parallel Implementation of the Davidson Method for Generalized Eigenproblems
A Parallel Implementation of the Davidson Method for Generalized Eigenproblems Eloy ROMERO 1 and Jose E. ROMAN Instituto ITACA, Universidad Politécnica de Valencia, Spain Abstract. We present a parallel
More informationReducing the Amount of Pivoting in Symmetric Indefinite Systems
Reducing the Amount of Pivoting in Symmetric Indefinite Systems Dulceneia Becker 1, Marc Baboulin 4, and Jack Dongarra 1,2,3 1 University of Tennessee, USA [dbecker7,dongarra]@eecs.utk.edu 2 Oak Ridge
More informationScalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti
Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationDesign of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization
Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field
More informationParallel Numerical Linear Algebra
Parallel Numerical Linear Algebra 1 QR Factorization solving overconstrained linear systems tiled QR factorization using the PLASMA software library 2 Conjugate Gradient and Krylov Subspace Methods linear
More informationAvoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University
Avoiding communication in linear algebra Laura Grigori INRIA Saclay Paris Sud University Motivation Algorithms spend their time in doing useful computations (flops) or in moving data between different
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationIterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay
Iterative methods, preconditioning, and their application to CMB data analysis Laura Grigori INRIA Saclay Plan Motivation Communication avoiding for numerical linear algebra Novel algorithms that minimize
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationTOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM
TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM AZZAM HAIDAR, HATEM LTAIEF, AND JACK DONGARRA Abstract. Classical solvers for the dense symmetric eigenvalue
More informationParalleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures
Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Hurricane
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationPreconditioned Eigensolver LOBPCG in hypre and PETSc
Preconditioned Eigensolver LOBPCG in hypre and PETSc Ilya Lashuk, Merico Argentati, Evgueni Ovtchinnikov, and Andrew Knyazev Department of Mathematics, University of Colorado at Denver, P.O. Box 173364,
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics
More informationOn aggressive early deflation in parallel variants of the QR algorithm
On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More information