QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Similar documents
LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Communication-avoiding LU and QR factorizations for multicore architectures

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Communication-avoiding parallel and sequential QR factorizations

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Avoiding Communication in Distributed-Memory Tridiagonalization

Communication avoiding parallel algorithms for dense matrix factorizations

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Communication-avoiding parallel and sequential QR factorizations

Dynamic Scheduling within MAGMA

Computing least squares condition numbers on hybrid multicore/gpu systems

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Scalable, Robust, Fault-Tolerant Parallel QR Factorization

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Hierarchical QR factorization algorithms for multi-core cluster systems

Minimizing Communication in Linear Algebra. James Demmel 15 June

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

2.5D algorithms for distributed-memory computing

A distributed packed storage for large dense parallel in-core calculations

Introduction to communication avoiding linear algebra algorithms in high performance computing

Lightweight Superscalar Task Execution in Distributed Memory

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Theoretical Computer Science

Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling

Efficient implementation of the overlap operator on multi-gpus

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Direct Self-Consistent Field Computations on GPU Clusters

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

MARCH 24-27, 2014 SAN JOSE, CA

Binding Performance and Power of Dense Linear Algebra Operations

Porting a Sphere Optimization Program from lapack to scalapack

Provably efficient algorithms for numerical tensor algebra

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

A hybrid Hermitian general eigenvalue solver

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Parallelization of the Molecular Orbital Program MOS-F

Preconditioned Parallel Block Jacobi SVD Algorithm

Porting a sphere optimization program from LAPACK to ScaLAPACK

Computing least squares condition numbers on hybrid multicore/gpu systems

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Parallel Numerical Algorithms

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Modeling and Tuning Parallel Performance in Dense Linear Algebra

COALA: Communication Optimal Algorithms

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

On the design of parallel linear solvers for large scale problems

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

Improvements for Implicit Linear Equation Solvers

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

On aggressive early deflation in parallel variants of the QR algorithm

Saving Energy in Sparse and Dense Linear Algebra Computations

Performance optimization of WEST and Qbox on Intel Knights Landing

c 2015 Society for Industrial and Applied Mathematics

Parallel Eigensolver Performance on High Performance Computers 1

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Efficient algorithms for symmetric tensor contractions

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

arxiv: v1 [cs.ms] 18 Nov 2016

ECS289: Scalable Machine Learning

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Finite-choice algorithm optimization in Conjugate Gradients

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

ab initio Electronic Structure Calculations

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Eliminations and echelon forms in exact linear algebra

Review: From problem to parallel algorithm

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

ERLANGEN REGIONAL COMPUTING CENTER

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

MPI Implementations for Solving Dot - Product on Heterogeneous Platforms

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Transcription:

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT (U. Paris Sud / U. of Tennessee / LRI / INRIA) Julien LANGOU (University of Colorado Denver) IPDPS, Atlanta, USA, April 19-23, 2010 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 1

Question Introduction Can we speed up dense linear algebra applications using a computational grid? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 2

Building blocks Introduction Tremendous computational power of grid infrastructures BOINC: 2.4 Pflop/s, Folding@home: 7.9 Pflop/s. MPI-based linear algebra libraries ScaLAPACK; HP Linpack. Grid-enabled MPI middleware MPICH-G2; PACX-MPI; GridMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 3

Past answers Introduction Can we speed up dense linear algebra applications using a computational grid? GrADS project [Petitet et al., 2001]: Grid enables to process larger matrices; For matrices that can fit in the (distributed) memory of a cluster, the use of a single cluster is optimal. Study on a cloud infrastructure [Napper et al., 2009] Linpack on Amazon EC2 commercial offer: Under-calibrated components; Grid costs too much Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 4

Our approach Introduction Principle Confine intensive communications (ScaLAPACK calls) within the different geographical sites. Method Articulate: Communication-Avoiding algorithms [Demmel et al., 2008]; with a topology-aware middleware (QCG-OMPI). Focus QR factorization; Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 5

Outline Introduction 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 6

Outline Background 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 7

Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

Background TSQR / CAQR Communication-Avoiding QR (CAQR) [Demmel et al., 2008] Tall and Skinny QR (TSQR) CAQR R TSQR UPDATES Examples of applications for TSQR panel factorization in CAQR; block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers); linear least squares problems with a number of equations extremely larger than the number of unknowns. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

Background QCG-OMPI Topology-aware MPI middleware for the Grid MPICH-G2 description of the topology through the concept of colors: used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology; based on MPICH. QCG-OMPI resource-aware grid meta-scheduler (QosCosGrid); allocation of resources that match requirements expressed in a JobProfile (amount of memory, CPU speed, network properties between groups of processes,... ) application always executed on an appropriate resource topology. based on OpenMPI. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

Outline Articulation of TSQR with QCG-OMPI 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 10

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF without reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - non optimized tree 25 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK (panel factorization routine) - optimized tree Cluster 1 Illustration of ScaLAPACK PDEGQRF with reduce affinity Domain 1,1 Domain 1,2 Domain 1,3 Domain 1,4 Domain 1,5 Cluster 2 Domain 2,1 Domain 2,2 Domain 2,3 Domain 2,4 Cluster 3 Domain 3,1 Domain 3,2 10 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) TSQR - optimized tree 2 inter-cluster communications Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Communication pattern Communication pattern (M-by-3 matrix) ScaLAPACK TSQR Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI ScaLAPACK-based TSQR each domain is factorized with a ScaLAPACK call; the reduction is performed by pairs of communicators; the number of domains per cluster may vary from 1 to 64 (number of cores per cluster). QCG JobProfile processes are split into groups of equivalent computing power; good network connectivity inside each group (low latency, high bandwidth); lower network connectivity between the groups. Classical cluster of clusters approach (with a constraint on the relative size of the clusters to facilitate load balancing). Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 12

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI Comm. and computation breakdown (critical path) Computing R # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C TSQR log 2 (C) log 2 (C)(N 2 /2) (2MN 2 2/3N 3 )/C + 2/3 log 2 (C)N 3 Computing Q and R (on C clusters) # inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C TSQR 2 log 2 (C) 2 log 2 (C)(N 2 /2) (4MN 2 4/3N 3 )/C + 4/3 log 2 (C)N 3 C: number of clusters; 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

Outline Experiments 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 14

Experiments Experimental environment Experimental environment: Grid 5000 Four clusters 32 nodes per cluster 2 cores per node. AMD Opteron 246 (2 GHz/ 1MB L2 cache) up to AMD Opteron 2218 (2.6 GHz / 2MB L2 cache). Linux 2.6.30 ScaLAPACK 1.8.0 GotoBlas 1.26. 256 cores total (theoretical peak 2048 Gflop/s dgemm upperbound 940 Gflop/s). Orsay Bordeaux Sophia- Antipolis Toulouse Network Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay 0.07 7.97 6.98 6.12 Toulouse 0.03 9.03 8.18 Bordeaux 0.05 7.18 Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay 890 78 90 102 Toulouse 890 77 90 Bordeaux 890 83 Sophia 890 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 15

Outline Experiments ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 16

ScaLAPACK - N = 64 Experiments ScaLAPACK performance Gflop/s 35 30 25 20 15 10 5 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 17

ScaLAPACK - N = 128 Experiments ScaLAPACK performance Gflop/s 60 50 40 30 20 10 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 18

ScaLAPACK - N = 256 Experiments ScaLAPACK performance Gflop/s 60 50 40 30 20 10 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 19

ScaLAPACK - N = 512 Experiments ScaLAPACK performance Gflop/s 90 4 sites (128 nodes) 80 2 sites (64 nodes) 70 1 site (32 nodes) 60 50 40 30 20 10 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 20

Outline Experiments TSQR performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 21

Experiments TSQR - N = 64 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s 40 35 30 25 20 15 10 5 0 1 2 M = 8 388 608 M = 1 048 576 M = 131 072 M = 65 536 4 8 16 Number of domains 32 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 22

Experiments TSQR - N = 64 - all four clusters TSQR performance - effect of the number of domains per cluster 100 80 M = 33 554 432 M = 4 194 304 M = 524 288 M = 131 072 Gflop/s 60 40 20 0 1 2 4 8 16 32 Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 23

Experiments TSQR - N = 512 - one cluster TSQR performance - effect of the number of domains per cluster Gflop/s 100 80 60 40 20 M = 2 097 152 M = 1 048 576 M = 131 072 M = 65 536 0 1 2 4 8 16 Number of domains 32 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 24

Experiments TSQR - N = 512 - all four clusters TSQR performance - effect of the number of domains per cluster Gflop/s 350 300 250 200 150 100 50 0 1 2 4 M = 8 388 608 M = 2 097 152 M = 524 288 M = 262 144 8 16 32 Number of domains per cluster 64 Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 25

TSQR - N = 64 Experiments TSQR performance (optimum configuration) 100 80 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) Gflop/s 60 40 20 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 26

TSQR - N = 128 Experiments TSQR performance (optimum configuration) Gflop/s 140 120 100 80 60 40 20 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 27

TSQR - N = 256 Experiments TSQR performance (optimum configuration) Gflop/s 180 160 140 120 100 80 60 40 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 20 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 28

TSQR - N = 512 Experiments TSQR performance (optimum configuration) Gflop/s 300 250 200 150 100 50 4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 29

Outline Experiments TSQR vs ScaLAPACK performance 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 30

Experiments TSQR vs ScaLAPACK - N = 64 TSQR vs ScaLAPACK performance Gflop/s 90 TSQR (best) 80 ScaLAPACK (best) 70 60 50 40 30 20 10 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 31

Experiments TSQR vs ScaLAPACK - N = 128 TSQR vs ScaLAPACK performance 120 100 TSQR (best) ScaLAPACK (best) Gflop/s 80 60 40 20 0 100000 1e+06 1e+07 1e+08 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 32

Experiments TSQR vs ScaLAPACK - N = 256 TSQR vs ScaLAPACK performance Gflop/s 180 160 140 120 100 80 60 40 TSQR (best) ScaLAPACK (best) 20 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 33

Experiments TSQR vs ScaLAPACK - N = 512 TSQR vs ScaLAPACK performance Gflop/s 300 250 200 150 100 50 TSQR (best) ScaLAPACK (best) 0 100000 1e+06 1e+07 Number of rows (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 34

Outline Conclusion and future work 1. Background 2. Articulation of TSQR with QCG-OMPI 3. Experiments ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance 4. Conclusion and future work Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 35

Conclusion Conclusion and future work Conclusion Can we speed up dense linear algebra applications using a computational grid? Yes, at least for applications based on the QR factorization of Tall and Skinny matrices. Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 36

Future directions Conclusion and future work Future directions What about square matrices (CAQR)? LU and Cholesky factorizations? Can we benefit from recursive kernels? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 37

Conclusion and future work N = 64 - one cluster current investigations 80 70 60 Comparison of QR algorithms, N = 64, 1 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 50 GFlops 40 30 20 10 0 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 38

Conclusion and future work N = 64 - two clusters current investigations 140 120 100 Comparison of QR algorithms, N = 64, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops 80 60 40 20 0 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 39

Conclusion and future work N = 64 - all four clusters current investigations 300 250 Comparison of QR algorithms, N = 64, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 200 GFlops 150 100 50 0 0 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 40

Conclusion and future work N = 512 - one cluster current investigations 140 120 100 Comparison of QR algorithms, N = 512, 1 cluster Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 GFlops 80 60 40 20 0 0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 41

Conclusion and future work N = 512 - two clusters current investigations 250 200 Comparison of QR algorithms, N = 512, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 150 GFlops 100 50 0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 42

Conclusion and future work N = 512 - all four clusters current investigations 500 450 400 350 Comparison of QR algorithms, N = 512, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 300 GFlops 250 200 150 100 50 0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 Matrix height (M) Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 43

Thanks Conclusion and future work current investigations Questions? Agullo - Coti - Dongarra - Hérault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 44