QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Size: px

Start display at page:

Download "QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment"

Francis Bridges
5 years ago
Views:

1 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT (U. Paris Sud / U. of Tennessee / LRI / INRIA) Julien LANGOU (University of Colorado Denver) IPDPS, Atlanta, USA, April 19-23, 2010 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 1

2 Question Introduction Can we speed up dense linear algebra applications using a computational grid? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 2

3 Building blocks Introduction Tremendous computational power of grid infrastructures BOINC: 2.4 Pflop/s, 7.9 Pflop/s. MPI-based linear algebra libraries ScaLAPACK; HP Linpack. Grid-enabled MPI middleware MPICH-G2; PACX-MPI; GridMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 3

4 Past answers Introduction Can we speed up dense linear algebra applications using a computational grid? GrADS project [Petitet et al., 2001]: Grid enables to process larger matrices; For matrices that can fit in the (distributed) memory of a cluster, the use of a single cluster is optimal. Study on a cloud infrastructure [Napper et al., 2009] Linpack on Amazon EC2 commercial offer: Under-calibrated components; Grid costs too much Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 4

5 Our approach Introduction Principle Confine intensive communications (ScaLAPACK calls) within the different geographical sites. Method Articulate: Communication-Avoiding algorithms [Demmel et al., 2008]; with a topology-aware middleware (QCG-OMPI). Focus QR factorization; Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 5

6 Outline Introduction 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 6

7 Outline Background 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 7

8 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

9 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

10 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

11 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

12 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

13 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

14 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

15 Outline Articulation of TSQR with QCG-OMPI 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 10

16 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

17 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

18 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 10 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

19 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

20 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

21 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK TSQR Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

22 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI ScaLAPACK-based TSQR each domain is factorized with a ScaLAPACK call; the reduction is performed by pairs of communicators; the number of domains per cluster may vary from 1 to 64 (number of cores per cluster). QCG JobProfile processes are split into groups of equivalent computing power; good network connectivity inside each group (low latency, high bandwidth); lower network connectivity between the groups. Classical cluster of clusters approach (with a constraint on the relative size of the clusters to facilitate load balancing). Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 12

23 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

24 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

25 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

26 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

27 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

28 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

29 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

30 Outline Experiments 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 14

31 Experiments Experimental environment Experimental environment: Grid 5000 Four clusters 32 nodes per cluster 2 cores per node. AMD Opteron 246 (2 GHz/ 1MB L2 cache) up to AMD Opteron 2218 (2.6 GHz / 2MB L2 cache). Linux ScaLAPACK GotoBlas cores total (theoretical peak 2048 Gflop/s dgemm upperbound 940 Gflop/s). Orsay Bordeaux Sophia- Antipolis Toulouse Network Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 15

32 Outline Experiments ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 16

33 ScaLAPACK - N = 64 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 17

34 ScaLAPACK - N = 128 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 18

35 ScaLAPACK - N = 256 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 19

36 ScaLAPACK - N = 512 Experiments ScaLAPACK performance Gflop/s 90 4 sites (128 nodes) 80 2 sites (64 nodes) 70 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 20

37 Outline Experiments TSQR performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 21

38 Experiments TSQR - N = 64 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 22

39 Experiments TSQR - N = 64 - all four clusters TSQR performance - effect of the number of domains per cluster M = M = M = M = Gflop/s Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 23

40 Experiments TSQR - N = one cluster TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 24

41 Experiments TSQR - N = all four clusters TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 25

42 TSQR - N = 64 Experiments TSQR performance (optimum configuration) sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) Gflop/s e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 26

43 TSQR - N = 128 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 27

44 TSQR - N = 256 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 28

45 TSQR - N = 512 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 29

46 Outline Experiments TSQR vs ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 30

47 Experiments TSQR vs ScaLAPACK - N = 64 TSQR vs ScaLAPACK performance Gflop/s 90 TSQR (best) 80 ScaLAPACK (best) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 31

48 Experiments TSQR vs ScaLAPACK - N = 128 TSQR vs ScaLAPACK performance TSQR (best) ScaLAPACK (best) Gflop/s e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 32

49 Experiments TSQR vs ScaLAPACK - N = 256 TSQR vs ScaLAPACK performance Gflop/s TSQR (best) ScaLAPACK (best) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 33

50 Experiments TSQR vs ScaLAPACK - N = 512 TSQR vs ScaLAPACK performance Gflop/s TSQR (best) ScaLAPACK (best) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 34

51 Outline Conclusion and future work 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 35

52 Conclusion Conclusion and future work Conclusion Can we speed up dense linear algebra applications using a computational grid? Yes, at least for applications based on the QR factorization of Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 36

53 Future directions Conclusion and future work Future directions What about square matrices (CAQR)? LU and Cholesky factorizations? Can we benefit from recursive kernels? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 37

54 Conclusion and future work N = 64 - one cluster current investigations Comparison of QR algorithms, N = 64, 1 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 50 GFlops e+06 1e e+07 2e e+07 3e e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 38

55 Conclusion and future work N = 64 - two clusters current investigations Comparison of QR algorithms, N = 64, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 39

56 Conclusion and future work N = 64 - all four clusters current investigations Comparison of QR algorithms, N = 64, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 200 GFlops e+07 4e+07 6e+07 8e+07 1e e e+08 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 40

57 Conclusion and future work N = one cluster current investigations Comparison of QR algorithms, N = 512, 1 cluster Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops e e+06 2e e+06 3e e+06 4e e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 41

58 Conclusion and future work N = two clusters current investigations Comparison of QR algorithms, N = 512, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 150 GFlops e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 42

59 Conclusion and future work N = all four clusters current investigations Comparison of QR algorithms, N = 512, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 300 GFlops e+06 4e+06 6e+06 8e+06 1e e e e e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 43

60 Thanks Conclusion and future work current investigations Questions? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 44

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering