QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Size: px
Start display at page:

Download "QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment"

Transcription

1 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT (U. Paris Sud / U. of Tennessee / LRI / INRIA) Julien LANGOU (University of Colorado Denver) IPDPS, Atlanta, USA, April 19-23, 2010 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 1

2 Question Introduction Can we speed up dense linear algebra applications using a computational grid? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 2

3 Building blocks Introduction Tremendous computational power of grid infrastructures BOINC: 2.4 Pflop/s, 7.9 Pflop/s. MPI-based linear algebra libraries ScaLAPACK; HP Linpack. Grid-enabled MPI middleware MPICH-G2; PACX-MPI; GridMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 3

4 Past answers Introduction Can we speed up dense linear algebra applications using a computational grid? GrADS project [Petitet et al., 2001]: Grid enables to process larger matrices; For matrices that can fit in the (distributed) memory of a cluster, the use of a single cluster is optimal. Study on a cloud infrastructure [Napper et al., 2009] Linpack on Amazon EC2 commercial offer: Under-calibrated components; Grid costs too much Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 4

5 Our approach Introduction Principle Confine intensive communications (ScaLAPACK calls) within the different geographical sites. Method Articulate: Communication-Avoiding algorithms [Demmel et al., 2008]; with a topology-aware middleware (QCG-OMPI). Focus QR factorization; Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 5

6 Outline Introduction 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 6

7 Outline Background 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 7

8 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

9 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

10 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

11 Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

12 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

13 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

14 Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

15 Outline Articulation of TSQR with QCG-OMPI 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 10

16 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

17 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

18 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 10 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

19 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

20 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

21 Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK TSQR Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

22 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI ScaLAPACK-based TSQR each domain is factorized with a ScaLAPACK call; the reduction is performed by pairs of communicators; the number of domains per cluster may vary from 1 to 64 (number of cores per cluster). QCG JobProfile processes are split into groups of equivalent computing power; good network connectivity inside each group (low latency, high bandwidth); lower network connectivity between the groups. Classical cluster of clusters approach (with a constraint on the relative size of the clusters to facilitate load balancing). Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 12

23 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

24 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

25 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

26 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

27 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

28 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

29 Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

30 Outline Experiments 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 14

31 Experiments Experimental environment Experimental environment: Grid 5000 Four clusters 32 nodes per cluster 2 cores per node. AMD Opteron 246 (2 GHz/ 1MB L2 cache) up to AMD Opteron 2218 (2.6 GHz / 2MB L2 cache). Linux ScaLAPACK GotoBlas cores total (theoretical peak 2048 Gflop/s dgemm upperbound 940 Gflop/s). Orsay Bordeaux Sophia- Antipolis Toulouse Network Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay Toulouse Bordeaux Sophia 890 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 15

32 Outline Experiments ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 16

33 ScaLAPACK - N = 64 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 17

34 ScaLAPACK - N = 128 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 18

35 ScaLAPACK - N = 256 Experiments ScaLAPACK performance Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 19

36 ScaLAPACK - N = 512 Experiments ScaLAPACK performance Gflop/s 90 4 sites (128 nodes) 80 2 sites (64 nodes) 70 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 20

37 Outline Experiments TSQR performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 21

38 Experiments TSQR - N = 64 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 22

39 Experiments TSQR - N = 64 - all four clusters TSQR performance - effect of the number of domains per cluster M = M = M = M = Gflop/s Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 23

40 Experiments TSQR - N = one cluster TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 24

41 Experiments TSQR - N = all four clusters TSQR performance - effect of the number of domains per cluster Gflop/s M = M = M = M = Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 25

42 TSQR - N = 64 Experiments TSQR performance (optimum configuration) sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) Gflop/s e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 26

43 TSQR - N = 128 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 27

44 TSQR - N = 256 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 28

45 TSQR - N = 512 Experiments TSQR performance (optimum configuration) Gflop/s sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 29

46 Outline Experiments TSQR vs ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 30

47 Experiments TSQR vs ScaLAPACK - N = 64 TSQR vs ScaLAPACK performance Gflop/s 90 TSQR (best) 80 ScaLAPACK (best) e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 31

48 Experiments TSQR vs ScaLAPACK - N = 128 TSQR vs ScaLAPACK performance TSQR (best) ScaLAPACK (best) Gflop/s e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 32

49 Experiments TSQR vs ScaLAPACK - N = 256 TSQR vs ScaLAPACK performance Gflop/s TSQR (best) ScaLAPACK (best) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 33

50 Experiments TSQR vs ScaLAPACK - N = 512 TSQR vs ScaLAPACK performance Gflop/s TSQR (best) ScaLAPACK (best) e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 34

51 Outline Conclusion and future work 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 35

52 Conclusion Conclusion and future work Conclusion Can we speed up dense linear algebra applications using a computational grid? Yes, at least for applications based on the QR factorization of Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 36

53 Future directions Conclusion and future work Future directions What about square matrices (CAQR)? LU and Cholesky factorizations? Can we benefit from recursive kernels? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 37

54 Conclusion and future work N = 64 - one cluster current investigations Comparison of QR algorithms, N = 64, 1 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 50 GFlops e+06 1e e+07 2e e+07 3e e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 38

55 Conclusion and future work N = 64 - two clusters current investigations Comparison of QR algorithms, N = 64, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 39

56 Conclusion and future work N = 64 - all four clusters current investigations Comparison of QR algorithms, N = 64, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 200 GFlops e+07 4e+07 6e+07 8e+07 1e e e+08 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 40

57 Conclusion and future work N = one cluster current investigations Comparison of QR algorithms, N = 512, 1 cluster Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops e e+06 2e e+06 3e e+06 4e e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 41

58 Conclusion and future work N = two clusters current investigations Comparison of QR algorithms, N = 512, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 150 GFlops e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 42

59 Conclusion and future work N = all four clusters current investigations Comparison of QR algorithms, N = 512, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 300 GFlops e+06 4e+06 6e+06 8e+06 1e e e e e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 43

60 Thanks Conclusion and future work current investigations Questions? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 44

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou To cite this version: Emmanuel Agullo, Camille Coti,

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Scalable, Robust, Fault-Tolerant Parallel QR Factorization

Scalable, Robust, Fault-Tolerant Parallel QR Factorization Scalable, Robust, Fault-Tolerant Parallel QR Factorization Camille Coti LIPN, CNRS UMR 7030, Université Paris 13, Sorbonne Paris Cité 99, avenue Jean-Baptiste Clément, F-93430 Villetaneuse, FRANCE camille.coti@lipn.univ-paris13.fr

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Hierarchical QR factorization algorithms for multi-core cluster systems

Hierarchical QR factorization algorithms for multi-core cluster systems Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado

More information

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra. James Demmel 15 June Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on

More information

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University

More information

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179 Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179 Marc Baboulin Luc Giraud Serge Gratton Julien Langou September 19, 2006 Abstract

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling

Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling 1 Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling Jack Dongarra and Piotr Luszczek Oak Ridge National Laboratory Rice University University

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

Porting a Sphere Optimization Program from lapack to scalapack

Porting a Sphere Optimization Program from lapack to scalapack Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential

More information

Provably efficient algorithms for numerical tensor algebra

Provably efficient algorithms for numerical tensor algebra Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems Marc Baboulin, Jack Dongarra, Rémi Lacroix To cite this version: Marc Baboulin, Jack Dongarra, Rémi Lacroix. Computing least squares

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

COALA: Communication Optimal Algorithms

COALA: Communication Optimal Algorithms COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France Collaborators and Supporters Collaborators at Berkeley (campus

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University Avoiding communication in linear algebra Laura Grigori INRIA Saclay Paris Sud University Motivation Algorithms spend their time in doing useful computations (flops) or in moving data between different

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS ROBERT GRANAT, BO KÅGSTRÖM, AND DANIEL KRESSNER Abstract A novel variant of the parallel QR algorithm for solving dense nonsymmetric

More information

On aggressive early deflation in parallel variants of the QR algorithm

On aggressive early deflation in parallel variants of the QR algorithm On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Enrique S. Quintana-Ortí quintana@icc.uji.es Power-aware execution of linear algebra libraries 1 CECAM Lausanne, Sept. 2011 Motivation

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

arxiv: v1 [cs.ms] 18 Nov 2016

arxiv: v1 [cs.ms] 18 Nov 2016 Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors Procedia Computer Science Procedia Computer Science 00 (2012) 1 10 International Conference on Computational Science, ICCS 2012 High Performance Dense Linear System Solver with Resilience to Multiple Soft

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Eliminations and echelon forms in exact linear algebra

Eliminations and echelon forms in exact linear algebra Eliminations and echelon forms in exact linear algebra Clément PERNET, INRIA-MOAIS, Grenoble Université, France East Coast Computer Algebra Day, University of Waterloo, ON, Canada, April 9, 2011 Clément

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay Iterative methods, preconditioning, and their application to CMB data analysis Laura Grigori INRIA Saclay Plan Motivation Communication avoiding for numerical linear algebra Novel algorithms that minimize

More information

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

Reducing the Amount of Pivoting in Symmetric Indefinite Systems Reducing the Amount of Pivoting in Symmetric Indefinite Systems Dulceneia Becker 1, Marc Baboulin 4, and Jack Dongarra 1,2,3 1 University of Tennessee, USA [dbecker7,dongarra]@eecs.utk.edu 2 Oak Ridge

More information

MPI Implementations for Solving Dot - Product on Heterogeneous Platforms

MPI Implementations for Solving Dot - Product on Heterogeneous Platforms MPI Implementations for Solving Dot - Product on Heterogeneous Platforms Panagiotis D. Michailidis and Konstantinos G. Margaritis Abstract This paper is focused on designing two parallel dot product implementations

More information

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines The Design and Implementation of the Parallel Out-of-core ScaLAPACK LU, QR and Cholesky Factorization Routines Eduardo D Azevedo Jack Dongarra Abstract This paper describes the design and implementation

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information