QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT (U. Paris Sud / U. of Tennessee / LRI / INRIA) Julien LANGOU (University of Colorado Denver) IPDPS, Atlanta, USA, April 19-23, 2010 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 1

Question Introduction Can we speed up dense linear algebra applications using a computational grid? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 2

Building blocks Introduction Tremendous computational power of grid infrastructures BOINC: 2.4 Pflop/s, Folding@home: 7.9 Pflop/s. MPI-based linear algebra libraries ScaLAPACK; HP Linpack. Grid-enabled MPI middleware MPICH-G2; PACX-MPI; GridMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 3

Past answers Introduction Can we speed up dense linear algebra applications using a computational grid? GrADS project [Petitet et al., 2001]: Grid enables to process larger matrices; For matrices that can fit in the (distributed) memory of a cluster, the use of a single cluster is optimal. Study on a cloud infrastructure [Napper et al., 2009] Linpack on Amazon EC2 commercial offer: Under-calibrated components; Grid costs too much Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 4

Our approach Introduction Principle Confine intensive communications (ScaLAPACK calls) within the different geographical sites. Method Articulate: Communication-Avoiding algorithms [Demmel et al., 2008]; with a topology-aware middleware (QCG-OMPI). Focus QR factorization; Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 5

Outline Introduction 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 6

Outline Background 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 7

Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

Outline Articulation of TSQR with QCG-OMPI 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 10

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 10 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK TSQR Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI ScaLAPACK-based TSQR each domain is factorized with a ScaLAPACK call; the reduction is performed by pairs of communicators; the number of domains per cluster may vary from 1 to 64 (number of cores per cluster). QCG JobProfile processes are split into groups of equivalent computing power; good network connectivity inside each group (low latency, high bandwidth); lower network connectivity between the groups. Classical cluster of clusters approach (with a constraint on the relative size of the clusters to facilitate load balancing). Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 12

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Outline Experiments 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 14

Experiments Experimental environment Experimental environment: Grid 5000 Four clusters 32 nodes per cluster 2 cores per node. AMD Opteron 246 (2 GHz/ 1MB L2 cache) up to AMD Opteron 2218 (2.6 GHz / 2MB L2 cache). Linux 2.6.30 ScaLAPACK 1.8.0 GotoBlas 1.26. 256 cores total (theoretical peak 2048 Gflop/s dgemm upperbound 940 Gflop/s). Orsay Bordeaux Sophia- Antipolis Toulouse Network Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay 0.07 7.97 6.98 6.12 Toulouse 0.03 9.03 8.18 Bordeaux 0.05 7.18 Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay 890 78 90 102 Toulouse 890 77 90 Bordeaux 890 83 Sophia 890 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 15

Outline Experiments ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 16

ScaLAPACK - N = 64 Experiments ScaLAPACK performance Gflop/s 35 30 25 20 15 10 5 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 17

ScaLAPACK - N = 128 Experiments ScaLAPACK performance Gflop/s 60 50 40 30 20 10 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 18

ScaLAPACK - N = 256 Experiments ScaLAPACK performance Gflop/s 60 50 40 30 20 10 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 19

ScaLAPACK - N = 512 Experiments ScaLAPACK performance Gflop/s 90 4 sites (128 nodes) 80 2 sites (64 nodes) 70 1 site (32 nodes) 60 50 40 30 20 10 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 20

Outline Experiments TSQR performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 21

Experiments TSQR - N = 64 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s 40 35 30 25 20 15 10 5 0 1 2 M = 8 388 608 M = 1 048 576 M = 131 072 M = 65 536 4 8 16 Number of domains 32 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 22

Experiments TSQR - N = 64 - all four clusters TSQR performance - effect of the number of domains per cluster 100 80 M = 33 554 432 M = 4 194 304 M = 524 288 M = 131 072 Gflop/s 60 40 20 0 1 2 4 8 16 32 Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 23

Experiments TSQR - N = 512 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s 100 80 60 40 20 M = 2 097 152 M = 1 048 576 M = 131 072 M = 65 536 0 1 2 4 8 16 Number of domains 32 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 24

Experiments TSQR - N = 512 - all four clusters TSQR performance - effect of the number of domains per cluster Gflop/s 350 300 250 200 150 100 50 0 1 2 4 M = 8 388 608 M = 2 097 152 M = 524 288 M = 262 144 8 16 32 Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 25

TSQR - N = 64 Experiments TSQR performance (optimum configuration) 100 80 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) Gflop/s 60 40 20 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 26

TSQR - N = 128 Experiments TSQR performance (optimum configuration) Gflop/s 140 120 100 80 60 40 20 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 27

TSQR - N = 256 Experiments TSQR performance (optimum configuration) Gflop/s 180 160 140 120 100 80 60 40 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 20 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 28

TSQR - N = 512 Experiments TSQR performance (optimum configuration) Gflop/s 300 250 200 150 100 50 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 29

Outline Experiments TSQR vs ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 30

Experiments TSQR vs ScaLAPACK - N = 64 TSQR vs ScaLAPACK performance Gflop/s 90 TSQR (best) 80 ScaLAPACK (best) 70 60 50 40 30 20 10 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 31

Experiments TSQR vs ScaLAPACK - N = 128 TSQR vs ScaLAPACK performance 120 100 TSQR (best) ScaLAPACK (best) Gflop/s 80 60 40 20 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 32

Experiments TSQR vs ScaLAPACK - N = 256 TSQR vs ScaLAPACK performance Gflop/s 180 160 140 120 100 80 60 40 TSQR (best) ScaLAPACK (best) 20 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 33

Experiments TSQR vs ScaLAPACK - N = 512 TSQR vs ScaLAPACK performance Gflop/s 300 250 200 150 100 50 TSQR (best) ScaLAPACK (best) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 34

Outline Conclusion and future work 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 35

Conclusion Conclusion and future work Conclusion Can we speed up dense linear algebra applications using a computational grid? Yes, at least for applications based on the QR factorization of Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 36

Future directions Conclusion and future work Future directions What about square matrices (CAQR)? LU and Cholesky factorizations? Can we benefit from recursive kernels? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 37

Conclusion and future work N = 64 - one cluster current investigations 80 70 60 Comparison of QR algorithms, N = 64, 1 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 50 GFlops 40 30 20 10 0 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 38

Conclusion and future work N = 64 - two clusters current investigations 140 120 100 Comparison of QR algorithms, N = 64, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops 80 60 40 20 0 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 39

Conclusion and future work N = 64 - all four clusters current investigations 300 250 Comparison of QR algorithms, N = 64, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 200 GFlops 150 100 50 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 40

Conclusion and future work N = 512 - one cluster current investigations 140 120 100 Comparison of QR algorithms, N = 512, 1 cluster Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops 80 60 40 20 0 0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 41

Conclusion and future work N = 512 - two clusters current investigations 250 200 Comparison of QR algorithms, N = 512, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 150 GFlops 100 50 0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 42

Conclusion and future work N = 512 - all four clusters current investigations 500 450 400 350 Comparison of QR algorithms, N = 512, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 300 GFlops 250 200 150 100 50 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 43

Thanks Conclusion and future work current investigations Questions? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 44