Hierarchical QR factorization algorithms for multi-core cluster systems

Size: px
Start display at page:

Download "Hierarchical QR factorization algorithms for multi-core cluster systems"

Transcription

1 Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado Denver, USA Ecole Normale Supérieure de Lyon, France JUNE 29, 2012

2 reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 2 of 103

3 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 3 of 103

4 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on The QR factoriza4on of a long and skinny matrix with its data par44oned ver4cally across several processors arises in a wide range of applica4ons. Input: A is block distributed by rows Output: Q is block distributed by rows R is global R A 1 Q 1 A 2 Q 2 A 3 Q 3 Julien Langou University of Colorado Denver Hierarchical QR 4 of 103

5 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Example of applica3ons: in block itera3ve methods. a) in itera3ve methods with mul3ple right hand sides (block itera4ve methods:) 1) Trilinos (Sandia Na4onal Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk). 2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, b) in itera3ve methods with a single right hand side 1) s step methods for linear systems of equa4ons (e.g. A. Chronopoulos), 2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc, 3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley). e) in itera3ve eigenvalue solvers, 1) PETSc (Argonne Na4onal Lab.) through BLOPEX (A. Knyazev, UCDHSC), 2) HYPRE (Lawrence Livermore Na4onal Lab.) through BLOPEX, 3) Trilinos (Sandia Na4onal Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), 4) PRIMME (A. Stathopoulos, Coll. William & Mary ), 5) And also TRLAN, BLZPACK, IRBLEIGS. Julien Langou University of Colorado Denver Hierarchical QR 5 of 103

6 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Reduce Algorithms: Introduc4on Example of applica3ons: a) in linear least squares problems which the number of equa4ons is extremely larger than the number of unknowns b) in block itera4ve methods (itera4ve methods with mul4ple right hand sides or itera4ve eigenvalue solvers) c) in dense large and more square QR factoriza4on where they are used as the panel factoriza4on step Julien Langou University of Colorado Denver Hierarchical QR 6 of 103

7 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf LAPACK block QR (right looking): dgeqrf U R L A (1) V A (1) Panel factoriza4on dgek2 lu( ) dgeqf2 + dlarq qr( ) Update of the remaining submatrix dtrsm (+ dswp) \ dgemm dlarr L U A (2) V R A (2) Julien Langou University of Colorado Denver Hierarchical QR 7 of 103

8 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blocked LU and QR algorithms (LAPACK) LAPACK block LU (right looking): dgetrf U L A (1) Panel factoriza4on Update of the remaining submatrix dgek2 lu( ) dtrsm (+ dswp) \ dgemm Latency bounded: more than nb AllReduce for n*nb 2 ops CPU bandwidth bounded: the bulk of the computa4on: n*n*nb ops highly paralleliable, efficient and saclable. L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 8 of 103

9 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS U dgemm L A (1) dgek2 lu( ) dtrsm (+ dswp) \ dgemm L U A (2) Julien Langou University of Colorado Denver Hierarchical QR 9 of 103

10 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Paralleliza3on of LU and QR. Parallelize the update: Easy and done in any reasonable soeware. This is the 2/3n 3 term in the FLOPs count. Can be done efficiently with LAPACK+mul4threaded BLAS Parallelize the panel factoriza3on: Not an op4on in mul4core context ( p < 16 ) See e.g. ScaLAPACK or HPL but s4ll by far the slowest and the bojleneck of the computa4on. Hide the panel factoriza3on: Lookahead (see e.g. High Performance LINPACK) Dynamic Scheduling dgemm dgek2 lu( ) dgek2 lu( ) Julien Langou University of Colorado Denver Hierarchical QR 10 of 103

11 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Hiding the panel factoriza4on with dynamic scheduling. Time Courtesy from Alfredo Bujari, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 11 of 103

12 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? Julien Langou University of Colorado Denver Hierarchical QR 12 of 103

13 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 NB = 64 procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 13 of 103

14 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on in the MM, actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 14 of 103

15 Tall and Skinny matrices Minimizing communication in parallel distributed (29) What about strong scalability? N = 1536 We can not hide the panel factoriza4on (n 2 ) with the MM(n 3 ), actually NB = 64 it is the MMs that are hidden by the panel factoriza4ons! procs = 16 NEED FOR NEW MATHEMATICAL ALGORITHMS Courtesy from Jakub Kurzak, UTennessee Julien Langou University of Colorado Denver Hierarchical QR 15 of 103

16 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons Julien Langou University of Colorado Denver Hierarchical QR 16 of 103

17 Tall and Skinny matrices Minimizing communication in parallel distributed (29) A new genera4on of algorithms? LINPACK (80 s) (Vector opera4ons) Algorithms follow hardware evolu3on along 3me. Rely on Level 1 BLAS opera4ons LAPACK (90 s) (Blocking, cache friendly) Rely on Level 3 BLAS opera4ons New Algorithms (00 s) (mul4core friendly) Rely on a DAG/scheduler block data layout some extra kernels Those new algorithms have a very low granularity, they scale very well (mul4core, petascale compu4ng, ) removes a lots of dependencies among the tasks, (mul4core, distributed compu4ng) avoid latency (distributed compu4ng, out of core) rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms. Julien Langou University of Colorado Denver Hierarchical QR 17 of 103

18 Tall and Skinny matrices Minimizing communication in parallel distributed (29) : New algorithms based on 2D par44onning: UTexas (van de Geijn): SYRK, CHOL (mul4core), LU, QR (out of core) UTennessee (Dongarra): CHOL (mul4core) HPC2N (Kågström)/IBM (Gustavson): Chol (Distributed) UCBerkeley (Demmel)/INRIA(Grigori): LU/QR (distributed) UCDenver (Langou): LU/QR (distributed) A 3 rd revolu4on for dense linear algebra? Julien Langou University of Colorado Denver Hierarchical QR 18 of 103

19 Tall and Skinny matrices Minimizing communication in parallel distributed (29) reducing communication in sequential in parallel distributed increasing parallelism (or reducing the critical path, reducing synchronization) We start with reduction communication in parallel distributed in the tall and skinny case. Julien Langou University of Colorado Denver Hierarchical QR 19 of 103

20 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes A 0 Q 0 processes A 1 Q 1 4me Julien Langou University of Colorado Denver Hierarchical QR 20 of 103

21 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 21 of 103

22 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) R (0) 0 ( ) R (0) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 22 of 103

23 Tall and Skinny matrices Minimizing communication in parallel distributed (29) On two processes QR ( ) R 0 (0) (, ) QR ( R (0) 0 ) R (0) 1 R (1) ( V (1) 0, 0 ) V (1) 1 A 0 V 0 (0) processes QR ( ) R 1 (0) (, ) A 1 V 1 (0) 4me Julien Langou University of Colorado Denver Hierarchical QR 23 of 103

24 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 Q 0 R A 1 Q 1 R processes A 2 A 3 R Q 2 R Q 3 A 4 Q 4 R A 5 Q 5 R A 6 Q 6 R A 4me QR Julien Langou University of Colorado Denver Hierarchical QR 24 of 103

25 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 25 of 103

26 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 26 of 103

27 Tall and Skinny matrices Minimizing communication in parallel distributed (29) The big picture. A 0 A 1 processes A 2 A 3 A 4 A 5 A 6 4me computa3on communica3on Julien Langou University of Colorado Denver Hierarchical QR 27 of 103

28 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Latency but also possibility of fast panel factoriza4on. DGEQR3 is the recursive algorithm (see Elmroth and Gustavson, 2000), DGEQRF and DGEQR2 are the LAPACK rou4nes. Times include QR and DLARFT. Run on Pen4um III. QR factoriza3on and construc3on of T m = 10,000 Perf in MFLOP/sec (Times in sec) n DGEQR3 DGEQRF DGEQR (0.29) 65.0 (0.77) 64.6 (0.77) (0.83) 62.6 (3.17) 65.3 (3.04) (1.60) 81.6 (5.46) 64.2 (6.94) (2.53) (7.09) 65.9 (11.98) MFLOP/sec m=1000,000, the x axis is n Julien Langou University of Colorado Denver Hierarchical QR 28 of 103

29 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Blue Gene L frost.ncar.edu Q and R: Strong scalability In this experiment, we fix the problem: m=1,000,000 and n=50. Then we increase the number of processors. MFLOPs/sec/proc # of processors ReduceHH (QR3) ReduceHH (QRF) ScaLAPACK QRF ScaLAPACK QR2 Julien Langou University of Colorado Denver Hierarchical QR 29 of 103

30 Tall and Skinny matrices Minimizing communication in parallel distributed (29) communication :: TSQR :: parallel case QR ( ) R (0) ( 0 ) When only R is wanted R (0) 0 QR ( R (0) 1 ) ( R (1) 0 ) R (1) 0 QR ( R (1) 2 ) ( R ) A 0 QR ( ) R (0) ( 1 ) A 1 processes QR ( ) R (0) ( 2 ) R (0) 2 QR ( R (0) 3 ) ( R (1) 2 ) Consider architecture: parallel case: P processing units 4me problem: QR factorization of a m-by-n TS matrix (TS = m/p n) (main) assumption: the operation is truly parallel distributed answer: binary tree theory: TSQR ScaLAPACK-like Lower bound # flops 2mn 2 P + 2n3 3 log P 2mn 2 2n3 P 3P A 2 QR ( ) A 3 R (0) ( 3 ) ( Θ n # words 2 2 log P n 2 2 log P n 2 2 log P # messages log P 2n log P log P mn 2 P ) Julien Langou University of Colorado Denver Hierarchical QR 30 of 103

31 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Any tree does. When only R is wanted: The MPI_Allreduce In the case where only R is wanted, instead of construc4ng our own tree, one can simply use MPI_Allreduce with a user defined opera4on. The opera4on we give to MPI is basically the Algorithm 2. It performs the opera4on: This binary opera4on is associa3ve and this is all MPI needs to use a user defined opera4on on a user defined datatype. Moreover, if we change the signs of the elements of R so that the diagonal of R holds posi4ve elements then the binary opera4on Rfactor becomes commuta3ve. The code becomes two lines: R QR ( 1 ) R R 2 lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info ); MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER, LILA_MPIOP_QR_UPPER, mpi_comm); Julien Langou University of Colorado Denver Hierarchical QR 31 of 103

32 Tall and Skinny matrices Minimizing communication in parallel distributed (29) Binary tree Flat A weird tree Another weird tree parallelism parallelism parallelism parallelism +me +me +me +me Julien Langou University of Colorado Denver Hierarchical QR 32 of 103

33 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 33 of 103

34 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 Julien Langou University of Colorado Denver Hierarchical QR 34 of 103

35 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 35 of 103

36 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of ScaLAPACK PDEGQR2 with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 36 of 103

37 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 37 of 103

38 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Cluster 1 Illustration of TSQR with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 Julien Langou University of Colorado Denver Hierarchical QR 38 of 103

39 Tall and Skinny matrices Minimizing communication in hierarchichal parallel distributed (6) Using 4 clusters of 32 processors. Classical Gram Schmidt Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments model TSQR binary tree Performance 10 3 for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x axis), n = 32 experiments 4 clusters, GFlops/sec model 2 clusters, GFlops/sec 10 2 m = 16.e cluster, GFlops/sec GFlop/sec 4 clusters, GFlops/sec 2 clusters, GFlops/sec m = 8.e+06 GFlop/sec m = 5.e+5 m = 2.e m 1 cluster, 9.27 GFlops/sec m Two effects at once: (1) avoiding communication with TSQR, (2) tuning of the reduction tree Julien Langou University of Colorado Denver Hierarchical QR 39 of 103

40 Tall and Skinny matrices Minimizing communication in sequential (1) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 40 of 103

41 Tall and Skinny matrices Minimizing communication in sequential (1) Consider architecture: sequential case: one processing unit with cache of size (W ) problem: QR factorization of a m-by-n TS matrix (TS = m n and W 3 2 n2 ) answer: flat tree theory: flat tree LAPACK-like Lower bound # flops 2mn 2 2mn 2 Θ(mn 2 ) # words 2mn # messages 3mn W m 2 n 2 2W mn 2 2W 2mn 2mn W Julien Langou University of Colorado Denver Hierarchical QR 41 of 103

42 Rectangular matrices Minimizing communication in sequential (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 42 of 103

43 Rectangular matrices Minimizing communication in sequential (2) 100 M fast = 10 6 M slow = 10 9 β = 10 8 γ = sequential case LU (or QR) factorization of square matrices 90 percent of CPU peak performance /2 n = M fast problem lower bound upper bound for CAQR flat tree lower bound for the LAPACK algorithm 1/2 n = M slow n (matrix size) Julien Langou University of Colorado Denver Hierarchical QR 43 of 103

44 Rectangular matrices Minimizing communication in sequential (2) sequential case LU (or QR) factorization of square matrices ( # of slow memory references ) / (problem lower bound) M fast = 10 3 lower bound for the LAPACK algorithm upper bound for CAQR flat tree M fast = 10 6 M fast = n size of the matrix Julien Langou University of Colorado Denver Hierarchical QR 44 of 103

45 Rectangular matrices Maximizing parallelism on multicore nodes (30) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 45 of 103

46 Rectangular matrices Maximizing parallelism on multicore nodes (30) Mission Perform the QR factorization of an initial tiled matrix A on a multicore platform. time Julien Langou University of Colorado Denver Hierarchical QR 46 of 103

47 Rectangular matrices Maximizing parallelism on multicore nodes (30) for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 47 of 103

48 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 48 of 103

49 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); V R T A dgeqrt 0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 49 of 103

50 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 50 of 103

51 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); R T V A dgeqrt 0 dlarb 0,1 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 51 of 103

52 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 52 of 103

53 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); R T V A dgeqrt 0 dlarb 0,1 dlarb 0,2 dlarb 0,3 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 53 of 103

54 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 54 of 103

55 Rectangular matrices Maximizing parallelism on multicore nodes (30) 1. dgeqrt(a[0][0], T[0][0]); 2. dlar+(a[0][0], T[0][0], A[0][1]); 3. dlar+(a[0][0], T[0][0], A[0][2]); 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); R T R dgeqrt 0 V A dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } Julien Langou University of Colorado Denver Hierarchical QR 55 of 103

56 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 56 of 103

57 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 57 of 103

58 Rectangular matrices Maximizing parallelism on multicore nodes (30) 4. dlar+(a[0][0], T[0][0], A[0][3]); 5. dtsqrt(a[0][0], A[1][0], T[1][0]); 6. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 7. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); 8. dssr+(a[1][0], T[1][0], A[0][1], A[1][1]); A T A dgeqrt 0 B V B dlarb 0,1 dlarb 0,2 dlarb 0,3 dtsqrt 1,0 for (k = 0; k < TILES; k++) { dgeqrt(a[k][k], T[k][k]); for (n = k+1; n < TILES; n++) { dlar+(a[k][k], T[k][k], A[k][n]); for (m = k+1; m < TILES; m++){ dtsqrt(a[k][k], A[m][k], T[m][k]); for (n = k+1; n < TILES; n++) dssr+(a[m][k], T[m][k], A[k][n], A[m][n]); } } dssrb 1,1 dssrb 1,1 dssrb 1,1 Julien Langou University of Colorado Denver Hierarchical QR 58 of 103

59 Rectangular matrices Maximizing parallelism on multicore nodes (30) DGEQRT DLARFB DLARFB DTSQRT DLARFB DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DLARFB DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DTSQRT DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DGEQRT DSSRFB DSSRFB DSSRFB DTSQRT DLARFB DTSQRT DLARFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DTSQRT DGEQRT DSSRFB DSSRFB DLARFB DTSQRT DSSRFB DGEQRT DAG for 5x5 matrix (QR factorization) Julien Langou University of Colorado Denver Hierarchical QR 59 of 103

60 are Rectangular matrices Maximizing parallelism on multicore nodes (30) A. 1:1 2:20 3:20 Fig. 9. 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 12:74 13:75 14:90 15:90 16:91 17:105 18:105 TilI 19:106 20:119 21:118 22:100 23:111 24:93 25:93 26:87 27:86 28:71 29:80 30:65 31:65 32:60 33:59 34:47 35:54 36:42 37:42 38:38 39:37 40:28 41:33 42:24 43:24 44:21 45:20 46:14 47:17 48:11 49:11 50:9 51:8 52:5 53:6 54:3 55:3 56:2 57:1 58:1 DAG for 20x20 matrix (QR factorization) Fig. 10. DAG for a LU factorization with 20 tiles (block size 200 and matrix size 4000). The size of the DAG grows very fast with the number of tiles. 1:1 2:20 3:20 4:21 5:39 6:39 7:40 8:57 9:57 10:58 11:74 more tasks until some are completed. The usage of a window of tasks has implications in how the loops of an application are unfolded and how much look ahead is 12:74 13:75 14:90 15:90 16:91 17:105 18:105 19:106 20:119 21:118 22:100 23:111 24:93 25:93 Julien Langou 26:87 27:86 University of Colorado Denver Hierarchical QR arch qua EM theo Gfl pra is e the div cac two has The 60 of 103

61 Rectangular matrices Maximizing parallelism on multicore nodes (30) Gflop/s Tile QR Factorization 3.2 GHz CELL Processor 20 SSSRFB Peak Tile QR Matrix Size Performance of the tile QR factorization Figure in single5: precision Performance on aof 3.2the GHz tilecell QR factorization processor with in single eight SPEs. precision Square on a 3.2 matrices were used. Solid horizontal line marks performance of the SSSRFB kernel GHz times CELL theprocessor number of with SPEs eight (22.16 SPEs. Square 8 = 177 matrices [Gflop/s]). were used. The solid The presented implementation of tile QR horizontal factorization line on marks the CELL performance processor of the allows SSSRFB for factorization kernel times of athe 4000 by 4000 number dense matrix in single precision in exactly half of a second. To the author s of knowledge, SPEs (22.16 at present, 8 = 177 it is [Gflop/s]). the fastest reported time of solving such problem by any semiconductor device implemented on a single semiconductor die. Jakub Kurzak and Jack Dongarra, LAWN 201 QR Factorization for the CELL 28 Processor, May Julien Langou University of Colorado Denver Hierarchical QR 61 of 103

62 Rectangular matrices Maximizing parallelism on multicore nodes (30) Perform the QR factorization of an initial tiled matrix A. time Our tool: Givens rotations: ( cos θ sin θ ) sin θ cos θ Julien Langou University of Colorado Denver Hierarchical QR 62 of 103

63 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 First column: The goal is to introduce zeros using Givens rotations. Only the top entry shall stay. At each step, this eliminates this. (This is a reduction.) time The first step requires seven computing units, the second four, the third two, and the fourth one. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

64 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

65 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

66 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

67 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 binomial tree 4 8 flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

68 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

69 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

70 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree 14 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

71 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

72 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

73 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Note: one point of the plasma tree is to (1) enhance parallelism in the rectangular case, (2) increase data locality (when working within a domain). Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

74 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

75 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) 5 8 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

76 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

77 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

78 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree 4 6 time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

79 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

80 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

81 Rectangular matrices Maximizing parallelism on multicore nodes (30) (p = 15) q = 1 q = 2 q = 3 q = 4 binomial tree flat tree plasma tree (bs=4) greedy tree time Flat tree - Sameh and Kuck (1978). Plasma tree - Hadri et al. (2010). Note: that binomial tree corresponds to bs = 1, and flat tree to bs = p. Greedy - Modi and Clarke (1984), Cosnard and Robert (1986). Cosnard and Robert (1986) proved that no matter the shape of the matrix (i.e., p and q), Greedy was optimal. Julien Langou University of Colorado Denver Hierarchical QR 63 of 103

82 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 64 of 103

83 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 65 of 103

84 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 66 of 103

85 Rectangular matrices Maximizing parallelism on multicore nodes (30) time Julien Langou University of Colorado Denver Hierarchical QR 67 of 103

86 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 1: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

87 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

88 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

89 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

90 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

91 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square 0 Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

92 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

93 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

94 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

95 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

96 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

97 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

98 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 0 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

99 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Note: it is understood that if a tile is already in triangle form, then the associated GEQRT and update kernels don t need to be applied. Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

100 Rectangular matrices Maximizing parallelism on multicore nodes (30) Kernels for orthogonal transformations Operation Panel Update Name Cost Name Cost Factor square into triangle GEQRT 4 UNMQR 6 Zero square with triangle on top TSQRT 6 TSMQR 12 Zero triangle with triangle on top TTQRT 2 TTMQR 6 Algorithm 2: Elimination elim(i, piv(i, k), k) via TS kernels. GEQRT (piv(i, k), k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) TSQRT (i, piv(i, k), k) for j = k + 1 to q do TSMQR(i, piv(i, k), k, j) Triangle on top of square Algorithm 2: Elimination elim(i, piv(i, k), k) via TT kernels. GEQRT (piv(i, k), k) GEQRT (i, k) for j = k + 1 to q do UNMQR(piv(i, k), k, j) UNMQR(i, k, j) TTQRT (i, piv(i, k), k) for j = k + 1 to q do TTMQR(i, piv(i, k), k, j) Triangle on top of triangle Julien Langou University of Colorado Denver Hierarchical QR 68 of 103

101 Rectangular matrices Maximizing parallelism on multicore nodes (30) The analysis and coding is a little more complex... time Julien Langou University of Colorado Denver Hierarchical QR 69 of 103

102 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem No matter what elimination list (any combination of TT, TS) is used the total weight of the tasks for performing a tiled QR factorization algorithm is constant and equal to 6pq 2 2q 3. Proof: L 1 :: GEQRT = TTQRT + q L 2 :: UNMQRT = TTMQRT + (1/2)q(q 1) L 3 :: TTQRT + TSQRT = pq (1/2)q(q + 1) L 4 :: TTMQR + TSMQR = (1/2)pq(q 1) (1/6)q(q 1)(q + 1) Define L 5 as 4L 1 + 6L 2 + 6L L 4 and we get L 5 :: 6pq 2 2q 3 = 4GEQRT + 12TSMQRT + 6TTMQRT + 2TTQRT + 6TSQRT + 6UNMQRT L 5 :: = total # of flops Note: Using our unit task weight of b 3 /3, with m = pb, and n = qb, we obtain 2mn 2 2/3n 3 flops which is the exact same number as for a standard Householder reflection algorithm as found in LAPACK or ScaLAPACK. Julien Langou University of Colorado Denver Hierarchical QR 70 of 103

103 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. The critical path length of FLATTREE is 2p + 2 if p q = 1 6p + 16q 22 if p > q > 1 22p 24 if p = q > 1 time nd Row 3 rd Row 4 th Row 2 nd Column 3 nd Column Initial: 10 Fill the pipeline: 6(p 1) Pipeline: 16(q 2) End: 4 Julien Langou University of Colorado Denver Hierarchical QR 71 of 103

104 Rectangular matrices Maximizing parallelism on multicore nodes (30) Greedy tree on a 15x4 matrix. (Weighted on top, coarse below.) time time Julien Langou University of Colorado Denver Hierarchical QR 72 of 103

105 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem For a tiled matrix of size p q, where p q. Given the time for any coarse grain algorithm, COARSE(p,q), the corresponding weighted algorithm time WEIGHTED(p,q) is 10 (q 1) + 6 COARSE(p, q 1) WEIGHTED(p, q) 10 (q 1)+6 COARSE(p, q 1)+4+2 (COARSE(p, q) COARSE(p, q 1)) Julien Langou University of Colorado Denver Hierarchical QR 73 of 103

106 Rectangular matrices Maximizing parallelism on multicore nodes (30) Theorem We can prove that the GRASAP algorithm is optimal. Julien Langou University of Colorado Denver Hierarchical QR 74 of 103

107 Rectangular matrices Maximizing parallelism on multicore nodes (30) Platform All experiments were performed on a 48-core machine composed of eight hexa-core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of Gflop/s for the whole machine. Experimental code is written using the QUARK scheduler. Results are tested with I Q T Q and A QR / A. Code has been written in real and complex arithmetic, single and double precision. Julien Langou University of Colorado Denver Hierarchical QR 75 of 103

108 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 76 of 103

109 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Experimental) 5 Best domain size for PLASMATREE (TT) = [ ] 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in time with respect to GREEDY (GREEDY = 1) q Julien Langou University of Colorado Denver Hierarchical QR 77 of 103

110 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 4.5 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY Overhead in cp length with respect to GREEDY (GREEDY = 1) Overhead in time with respect to GREEDY (GREEDY = 1) q Main difference is on the right where there is enough parallelism so all method give peak performance. (peak is TTMQRT performance) q Julien Langou University of Colorado Denver Hierarchical QR 78 of 103

111 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200) 160 Best domain size for PLASMATREE (TT) = [ ] GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Julien Langou University of Colorado Denver Hierarchical QR 79 of 103

112 Rectangular matrices Maximizing parallelism on multicore nodes (30) Complex Arithmetic (p=40,b=200, Model on left, Experimental on right) Best domain size for PLASMATREE (TT) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 20 FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q q Julien Langou University of Colorado Denver Hierarchical QR 80 of 103

113 Rectangular matrices Parallelism+communication on multicore nodes (2) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 81 of 103

114 Rectangular matrices Parallelism+communication on multicore nodes (2) Factor Kernels code name # flops GEQRT 4 TTQRT 2 TSQRT 6 We work on square b-by-b tiles. The unit for the task weights is b 3 /3. Each of our tasks is O(b 3 ) computation for O(b 2 ) communication. Thanks to this surface/volume effect, we can neglect in a first approximation communication and focus only on the parallelism. Julien Langou University of Colorado Denver Hierarchical QR 82 of 103

115 Rectangular matrices Parallelism+communication on multicore nodes (2) Using TT Kernels GEQRT GEQRT TTQRT Total # of flops: (sequential time) 2 * GEQRT + TTQRT = 2 * = 10 Critical Path Length: (parallel time) GEQRT + TTQRT = = 6 Using TS Kernels GEQRT TSQRT Total # of flops: (sequential time) GEQRT + TSQRT = = 10 Critical Path Length: (parallel time) GEQRT + TSQRT = = 10 One remarkable outcome is that the total number of flops is 10b 3 /3. If we consider the number of flops for a standard LAPACK algorithm (based on standard Householder transformation), we would also obtain 10b 3 /3. Julien Langou University of Colorado Denver Hierarchical QR 83 of 103

116 Rectangular matrices Parallelism+communication on multicore nodes (2) GFLOP/s TSQRT (in) GEQRT (in) TTQRT (in) GEQRT +TTQRT (in) GEMM (in) TSQRT (out) GEQRT (out) TTQRT (out) GEQRT +TTQRT (out) GEMM (out) tile size The TS kernels (factorization and update) are more efficient than the TT ones due a variety of reasons. (Better vectorization, better volume-to-surface ratio.) Julien Langou University of Colorado Denver Hierarchical QR 84 of 103

117 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, Model on left, Experimental on right) 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] 250 Best domain size for PLASMATREE (TS) = [ ] Best domain size for PLASMATREE (TT) = [ ] Predicted GFLOP/s GFLOP/s FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY 50 FLATTREE (TS) PLASMATREE (TS) (best) FLATTREE (TT) PLASMATREE (TT) (best) FIBONACCI (TT) GREEDY q Motivates the introduction of a TS-level. Square 1-by-1 tiles are grouped in a larger rectangular tiles of size a-by-1 which is eliminated in a TS fashion. The regular TT algorithm is then applied on the rectangular tiles. q pros: recover performance of TS level cons: introduce a tuning parameter (a) Julien Langou University of Colorado Denver Hierarchical QR 85 of 103

118 Rectangular matrices Parallelism+communication on multicore nodes (2) Real Arithmetic (p=40,b=200, performance on left, overhead on right) 250 P=40 MB=200,IB=32,cores = 48 2 P=40 MB=200,IB=32,cores = 48 Gflop/sec overhead wrt BEST BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a=8 50 BEST FT (TS) BINOMIAL (TT) FT (TT) FIBO (TT) Greedy (TT) Greedy+RRon+a=8 Greedy+RRoff+a= Q Q Idea: use TS domain of size a, use greedy on top of them, use round-robin on the domains (This experiment was done with DAGuE.) Julien Langou University of Colorado Denver Hierarchical QR 86 of 103

119 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Tall and Skinny matrices Minimizing communication in parallel distributed (29) Minimizing communication in hierarchichal parallel distributed (6) Minimizing communication in sequential (1) Rectangular matrices Minimizing communication in sequential (2) Maximizing parallelism on multicore nodes (30) Parallelism+communication on multicore nodes (2) Parallelism+communication on distributed+multicore nodes (13) Scheduling on multicore nodes (4) Julien Langou University of Colorado Denver Hierarchical QR 87 of 103

120 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Outline Three goals: 1. minimize the number of local communication 2. minimize the number of parallel distributed communication 3. maximize the parallelism within a node Julien Langou University of Colorado Denver Hierarchical QR 88 of 103

121 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

122 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

123 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical trees Julien Langou University of Colorado Denver Hierarchical QR 89 of 103

124 Rectangular matrices Parallelism+communication on distributed+multicore nodes (13) Hierarchical algorithm layout Legend P 0 P 1 P 2 0 local TS 1 local TT 2 domino 3 global tree Julien Langou University of Colorado Denver Hierarchical QR 90 of 103

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

arxiv: v1 [cs.ms] 18 Nov 2016

arxiv: v1 [cs.ms] 18 Nov 2016 Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Tiled QR factorization algorithms

Tiled QR factorization algorithms INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Tiled QR factorization algorithms Henricus Bouwmeester Mathias Jacuelin Julien Langou Yves Robert N 7601 Avril 2011 Distributed and High

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Tiled QR factorization algorithms

Tiled QR factorization algorithms Tiled QR factorization algorithms Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert To cite this version: Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert. Tiled QR factorization

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra. James Demmel 15 June Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on

More information

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) alfredo.buttari@enseeiht.fr for an up-to-date version of the slides: http://buttari.perso.enseeiht.fr Introduction Objective

More information

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

Overview of Dense Numerical Linear Algebra Libraries

Overview of Dense Numerical Linear Algebra Libraries Overview of Dense Numerical Linear Algebra Libraries BLAS: kernel for dense linear algebra LAPACK: sequen;al dense linear algebra ScaLAPACK: parallel distributed dense linear algebra And Beyond Linear

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou To cite this version: Emmanuel Agullo, Camille Coti,

More information

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

Parallelizing Gaussian Process Calcula1ons in R

Parallelizing Gaussian Process Calcula1ons in R Parallelizing Gaussian Process Calcula1ons in R Christopher Paciorek UC Berkeley Sta1s1cs Joint work with: Benjamin Lipshitz Wei Zhuo Prabhat Cari Kaufman Rollin Thomas UC Berkeley EECS (formerly) IBM

More information

Porting a Sphere Optimization Program from lapack to scalapack

Porting a Sphere Optimization Program from lapack to scalapack Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

A short note on the Householder QR factorization

A short note on the Householder QR factorization A short note on the Householder QR factorization Alfredo Buttari October 25, 2016 1 Uses of the QR factorization This document focuses on the QR factorization of a dense matrix. This method decomposes

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

The Future of LAPACK and ScaLAPACK

The Future of LAPACK and ScaLAPACK The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Communication-avoiding Krylov subspace methods

Communication-avoiding Krylov subspace methods Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008 Overview Motivation Current

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179 Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179 Marc Baboulin Luc Giraud Serge Gratton Julien Langou September 19, 2006 Abstract

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

A Parallel Implementation of the Davidson Method for Generalized Eigenproblems

A Parallel Implementation of the Davidson Method for Generalized Eigenproblems A Parallel Implementation of the Davidson Method for Generalized Eigenproblems Eloy ROMERO 1 and Jose E. ROMAN Instituto ITACA, Universidad Politécnica de Valencia, Spain Abstract. We present a parallel

More information

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

Reducing the Amount of Pivoting in Symmetric Indefinite Systems Reducing the Amount of Pivoting in Symmetric Indefinite Systems Dulceneia Becker 1, Marc Baboulin 4, and Jack Dongarra 1,2,3 1 University of Tennessee, USA [dbecker7,dongarra]@eecs.utk.edu 2 Oak Ridge

More information

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field

More information

Parallel Numerical Linear Algebra

Parallel Numerical Linear Algebra Parallel Numerical Linear Algebra 1 QR Factorization solving overconstrained linear systems tiled QR factorization using the PLASMA software library 2 Conjugate Gradient and Krylov Subspace Methods linear

More information

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University Avoiding communication in linear algebra Laura Grigori INRIA Saclay Paris Sud University Motivation Algorithms spend their time in doing useful computations (flops) or in moving data between different

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay Iterative methods, preconditioning, and their application to CMB data analysis Laura Grigori INRIA Saclay Plan Motivation Communication avoiding for numerical linear algebra Novel algorithms that minimize

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM AZZAM HAIDAR, HATEM LTAIEF, AND JACK DONGARRA Abstract. Classical solvers for the dense symmetric eigenvalue

More information

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Hurricane

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Preconditioned Eigensolver LOBPCG in hypre and PETSc

Preconditioned Eigensolver LOBPCG in hypre and PETSc Preconditioned Eigensolver LOBPCG in hypre and PETSc Ilya Lashuk, Merico Argentati, Evgueni Ovtchinnikov, and Andrew Knyazev Department of Mathematics, University of Colorado at Denver, P.O. Box 173364,

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006. LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics

More information

On aggressive early deflation in parallel variants of the QR algorithm

On aggressive early deflation in parallel variants of the QR algorithm On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information