Lightweight Superscalar Task Execution in Distributed Memory

Similar documents
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Dynamic Scheduling within MAGMA

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Binding Performance and Power of Dense Linear Algebra Operations

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Panorama des modèles et outils de programmation parallèle

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Some notes on efficient computing and setting up high performance computing environments

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

On the design of parallel linear solvers for large scale problems

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Tools for Power and Energy Analysis of Parallel Scientific Applications

INITIAL INTEGRATION AND EVALUATION

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Communication-avoiding LU and QR factorizations for multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Computing least squares condition numbers on hybrid multicore/gpu systems

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)

An Integrative Model for Parallelism

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

On the design of parallel linear solvers for large scale problems

Saving Energy in Sparse and Dense Linear Algebra Computations

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

Efficient algorithms for symmetric tensor contractions

Porting a sphere optimization program from LAPACK to ScaLAPACK

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

arxiv: v1 [cs.ms] 18 Nov 2016

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

MARCH 24-27, 2014 SAN JOSE, CA

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Porting a Sphere Optimization Program from lapack to scalapack

Linear Systems Performance Report

Cactus Tools for Petascale Computing

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Communication-avoiding parallel and sequential QR factorizations

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

How to deal with uncertainties and dynamicity?

Direct Self-Consistent Field Computations on GPU Clusters

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Analysis of PFLOTRAN on Jaguar

Performance Evaluation of Scientific Applications on POWER8

Algorithms and Methods for Fast Model Predictive Control

Special Nodes for Interface

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

ECS289: Scalable Machine Learning

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

A distributed packed storage for large dense parallel in-core calculations

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

A hybrid Hermitian general eigenvalue solver

Performance Analysis of Lattice QCD Application with APGAS Programming Model

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

CS 594 Scientific Computing for Engineers Spring Dense Linear Algebra. Mark Gates

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Communication-avoiding parallel and sequential QR factorizations

Parallel sparse direct solvers for Poisson s equation in streamer discharges

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

High-Performance Scientific Computing

Periodic motions. Periodic motions are known since the beginning of mankind: Motion of planets around the Sun; Pendulum; And many more...

Intel Math Kernel Library (Intel MKL) LAPACK

11 Parallel programming models

Dense Arithmetic over Finite Fields with CUMODP

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Review for the Midterm Exam

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Roundoff Error. Monday, August 29, 11

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Performance of WRF using UPC

Making electronic structure methods scale: Large systems and (massively) parallel computing

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Parallel Transposition of Sparse Data Structures

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

RWTH Aachen University

Transcription:

Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge, TN 3 University of Manchester, Manchester, UK MTAGS 2014 7th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers New Orleans, Lousiana Nov 16 2014 Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 1 / 24

Architecture and Missed Opportunities Distributed Memory Node Node Node Parallel Programming is difficult (still, again, yet). Node Node Node Shared Memory Node Coding is via Pthreads, MPI, OpenMP, UPC, etc. User handles complexities of coding, scheduling, execution, etc. Efficient and scalable programming is hard Often get undesired synchronization points. Fork-join wastes cores and reduces performance. We need to access more of provided parallelism. Larger multicore architectures More inactive cores = more waste Multicore Chip Shared memory Multicore Chip Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 2 / 24

Productivity, Efficiency, Scalability 1:1 POTRF 2:4 3:10 4:1 5:3 6:6 7:1 8:2 9:3 10:1 11:1 12:1 13:1 POTRF POTRF POTRF POTRF Productivity in Programming Have a simple, serial API for programming. Runtime environment handles all the details. Efficiency and Scalability Tasks have data dependencies. Tasks can execute as soon as data is ready (async). This results in a task-dag (directed acyclic graph). Nodes are tasks; edges are data dependencies Uses available cores in shared memory. Transfers data as required in distributed memory Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 3 / 24

Related Projects PaRSEC [UTK] : Framework for distributed memory task execution. Requires specialized parameterized compact task graph description; parameterized task graphs are hard to express. Very high performance is achievable. Implements DPLASMA. SMPss [Barcelona] : Shared memory. Compiler-pragma based, runtime-system with data locality and task-stealing, emphasis on data replication. MPI available via explicit wrappers. StarPU [INRIA] : Shared and distributed memory. Library API based, emphasis on heterogeneous scheduling (GPUs), smart data management, - similar to this work. Others: Charm++, Jade, Cilk, OpenMP, SuperMatrix, FLAME, ScaLAPACK,... Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 4 / 24

Driving Applications: Tile Linear Algebra Algorithms Block algorithms Standard linear algebra libraries (LAPACK, ScaLAPACK) gain parallelism from BLAS-3 interspersed with less parallel operations. Execution is fork-join (or block synchronous parallel). 2:4 Tile algorithms 3:10 4:1 Rewrite algorithms as tasks acting on 5:3 data tiles. 6:6 Tasks using data = data 7:1 8:2 dependencies = DAG 9:3 Want to execute DAGs asynchronously 10:1 and in parallel = runtime. 11:1 12:1 QUeuing and Runtime for Kernels for 13:1 Distributed Memory panel update step 1 step 2 step 3 step 4 1:1 POTRF POTRF POTRF POTRF POTRF Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 5 / 24

Tile QR Factorization Algorithm for k = 0... TILES 1 geqrt ( A rw kk, T kk w ) for n = k + 1.. TILES 1 unmqr( A r kk low, T kk r, Arw kn ) for m = k + 1.. TILES 1 t s q r t ( A rw kk up, Arw mk, T mk rw ) for n = k + 1.. TILES 1 tsmqr ( A r mk, T mk r, Arw kn, Arw mn ) List of tasks as they are generated by the loops UNMQR Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 6 / 24

Tile QR Factorization: Data Dependencies F0 geqrt ( A rw 00, T 00 w ) F1 unmqr( A r 00, T 00 r, Arw 01 ) F2 unmqr( A r 00, T 00 r, Arw 02 ) F3 t s q r t ( A rw 00, Arw 10, T 10 w ) F4 tsmqr ( A rw 01, Arw 11, Ar 10, T 10 r ) F5 tsmqr ( A rw 02, Arw 12, Ar 10, T 10 r ) F6 t s q r t ( A rw 00, Arw 20, T 20 w ) F7 tsmqr ( A rw 01, Arw 21, Ar 20, T 20 r ) F8 tsmqr ( A rw 02, Arw 22, Ar 20, T 20 r ) F9 geqrt ( A rw 11, T 11 w ) F10 unmqr( A r 11, T 11 w, Arw 12 ) F11 t s q r t ( A rw 11, Arw 21, T 21 w ) F12 tsmqr ( A rw 12, Arw 22, Ar 21, T 21 r ) F13 geqrt ( A rw 22, T 22 w ) Data dependencies from the f i r s t f i v e tasks i n the QR f a c t o r i z a t i o n A 00 : F 0 rw : F 1 r : F 2 r : F 3 rw A 01 : F 1 rw : F 4 rw A 02 : F 2 rw : F 5 rw A 10 : F 3 rw : F 4 r : F 5 r A 11 : F 4 rw A 12 : F 5 rw A 20 : A 21 : A 22 : Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 7 / 24

Tile QR Factorization: Dependencies to Execution F0 F1 F2 F3 F4 F5 First step in execution - Run task (function) F 0. geqrt ( A rw 00, T 00 w ) unmqr ( A r 00, T 00 r, Arw 01 ) unmqr ( A r 00, T 00 r, Arw 02 ) t s q r t ( A rw 00, Arw 10, T 10 w ) tsmqr ( A rw 01, Arw 11, Ar 10 T 10 r ) tsmqr ( A rw 02, Arw 12, Ar 10 T 10 r ) A 00 : F 0 rw : F 1 r : F 2 r : F 3 rw A 01 : F 1 rw : F 4 rw A 02 : F 2 rw : F 5 rw A 10 : F 3 rw : F 4 r : F 5 r A 11 : F 4 rw A 12 : F 5 rw T 00 : F 0 w : F 1 r : F 2 r T 10 : F 3 w : F 4 r : F 5 r F1 F2 F3 F4 F5 Second step in execution - Remove F 0; Now F 1 and F 2 are ready. unmqr ( A r 00, T 00 r, Arw 01 ) unmqr ( A r 00, T 00 r, Arw 02 ) t s q r t ( A rw 00, Arw 10, T 10 w ) tsmqr ( A rw 01, Arw 11, Ar 10 T 10 r ) tsmqr ( A rw 02, Arw 12, Ar 10 T 10 r ) A 00 : A 01 : A 02 : A 10 : A 11 : A 12 : T 00 : T 10 : F 1 r : F 2 r : F 3 rw F 1 rw : F 4 rw F 2 rw : F 5 rw F 3 rw : F 4 r : F 5 r F 4 rw F 5 rw F 1 r : F 2 r F 3 w : F 4 r : F 5 r Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 8 / 24

QUARK-D API and Runtime QUARK-D QUeuing and Runtime for Kernels in Distributed Memory Simple serial task insertion interface. QUARKD_Insert_Task ( quark, f u n c t i o n, t a s k f l a g s, a_ flags, size_a, a, a_home_process, a_key, b_ flags, size_b, b, b_home_process, b_key,..., 0 ) ; Manage the distributed details for the user. Scheduling tasks (where should tasks run) Data dependencies and movement (local and remote). Transparent communication. No global knowledge or coordination required. Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 9 / 24

Productivity: QUARK-D QR Implementation The code matches the pseudo-code #define A(m, n ) ADDR(A),HOME(m, n ),KEY(A,m, n ) #define T (m, n ) ADDR( T ),HOME(m, n ),KEY( T,m, n ) void plasma_pdgeqrf (A, T,. ) { for ( k = 0; k < M; k++) { TASK_dgeqrt ( quark,., A( k, k ), T ( k, k ) ) ; for ( n = k +1; n < N; n++) TASK_dormqr ( quark,.., A( k, k ), T ( k, k ),A( k, n ) ) ; for (m = k +1; m < M; m++) { TASK_dtsqrt ( quark,., A( k, k ),A(m, k ), T (m, k ) ) ; for ( n = k +1; n < N; n++) TASK_dtsmqr ( quark,., A( k, n ),A(m, n ), A(m, k ), T (m, k ) ) ; } } } for k = 0... TILES 1 geqrt ( A rw kk, T kk w ) for n = k + 1.. TILES 1 unmqr( A r kk low, T kk r, Arw kn ) for m = k + 1.. TILES 1 t s q r t ( A rw kk up, Arw mk, T mk rw ) for n = k + 1.. TILES 1 tsmqr ( A r mk, T mk r, Arw kn, Arw mn ) Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 10 / 24

Productivity: QUARK-D QR Implementation The task is inserted into the runtime and held till data is ready. void TASK_dgeqrt ( Quark quark,., i n t m, i n t n, double A, i n t A_home, key A_key, double T, i n t T_home, key T_key ) { QUARKD_Insert_Task ( quark, CORE_dgeqrt,..., VALUE, sizeof ( i n t ),&m, VALUE, sizeof ( i n t ),&n, INOUT LOCALITY, sizeof (A), A, A_home, A_key, OUTPUT, sizeof ( T ), T, T_home, T_key,., 0 ) ; } When the task is eventually executed, the dependencies are unpacked, and the serial core routine is called. void CORE_dgeqrt ( Quark quark ) { i n t m, n, ib, lda, l d t ; double A, T, TAU, WORK; quark_unpack_args_9 ( quark,m, n, ib, A, lda, T, l d t,tau,work) ; CORE_dgeqrt (m, n, ib, A, lda, T, l d t,tau,work) ; } Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 11 / 24

Distributed Memory Algorithm This pseudocode manages the distributed details for the user. / / running at each d i s t r i b u t e d node for each task T as i t i s i n s e r t e d / / determine P exe based on dependency to be kept l o c a l P exe = process t h a t w i l l run task T for each dependency A i i n T i f ( I am P exe ) && ( A i i s i n v a l i d here ) i n s e r t receive tasks ( A rw i ) else i f ( P exe has i n v a l i d A i ) && ( I own A i ) i n s e r t send tasks ( A r i ) / / t r a c k who i s c u r r e n t owner, who has v a l i d copies update dependency t r a c k i n g i f ( I am P exe ) i n s e r t task T i n t o shared memory runtime Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 12 / 24

QUARK-D Running the Distributed Memory Algorithm A00 A01 A02 A10 A11 A12 A20 A21 A22 UNMQR UNMQR UNMQR UNMQR UNMQR UNMQR F0 geqrt ( A rw 00, T 00 w ) F1 unmqr( A r 00, T 00 r, Arw 01 ) F2 unmqr( A r 00, T 00 r, Arw 02 ) F3 t s q r t ( A rw 00, Arw 10, T 10 w ) F4 tsmqr ( A rw 01, Arw 11, Ar 10, T 10 r ) F5 tsmqr ( A rw 02, Arw 12, Ar 10, T 10 r ) F6 t s q r t ( A rw 00, Arw 20, T 20 w ) F7 tsmqr ( A rw 01, Arw 21, Ar 20, T 20 r ) F8 tsmqr ( A rw 02, Arw 22, Ar 20, T 20 r ) F9 geqrt ( A rw 11, T 11 w ) F10 unmqr( A r 11, T 11 w, Arw 12 ) F11 t s q r t ( A rw 11, Arw 21, T 21 w ) F12 tsmqr ( A rw 12, Arw 22, Ar 21, T 21 r ) F13 geqrt ( A rw 22, T 22 w ) UNMQR (a) P0 UNMQR (b) P1 UNMQR (c) P2 Execution of a small QR factorization (DGEQRF). Three processes (P0, P1, P2) are running the factorization on 3x3 tile matrix using a 1 3 process grid. Note that and have locality on second RW parameter. Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 13 / 24

QUARK-D: QR DAG UNMQR UNMQR UNMQR UNMQR UNMQR QUARK-D s principles of operation. Scheduling the DAG of the distributed memory QR factorization. Three distributed memory processes are running the factorization algorithm on a 3x3 tile matrix. One multi-threaded process runs all the blue tasks, another multi-threaded process runs the green tasks, and a third runs the purple tasks. Colored links show local task dependencies. Black arrows show inter-process communications. UNMQR Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 14 / 24

QUARK-D: Key Developments Distributed scheduling A function tells us which process is going to run a task; usually based on data distribution (2D block cyclic) but any function that will evaluate the same on all processes. Execution within a multi-threaded process is completely dynamic. Decentralized data coherency protocol Processes coordinate the data movement without any control messages. Coordination is enabled by a data coherency protocol, where each process knows who is the current owner of a piece of data, and which processes have valid copies of that data. Asynchronous data transfer Data movement is initiated by tasks, then the message passing continues asynchronously without blocking other tasks. The data movement protocol is an eager protocol initiated by a send-data task. The receive-data task is activated by the message passing engine, and can get the data asynchronously (from temporary storage if necessary). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 15 / 24

QUARK-D: QR Trace Figure: Trace of a QR factorization of a matrix consisting of 16x16 tiles on 4 (2x2) distributed memory nodes using 4 computational threads per node. An independent MPI communication thread is also maintained. Color coding: MPI (pink); (green); (yellow); (cyan); UNMQR (red). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 16 / 24

QUARK-D: QR Weak Scaling: Small Cluster 1200 1000 Dist Mem nodes (2x4-core) 2.27 GHz Xeon [dancer] Theoretical Peak Performance PARSEC (nb=240,ib=48) QUARKD (nb=260,ib=52) ScaLAPACK (nb=128) 800 GFlops 600 400 200 0 0 20 40 60 80 100 120 140 Number of cores (8 cores/node) Figure: Weak scaling performance of QR factorization on a small cluster. Factorizing a matrix (5000x5000/per core) on up to 16 distributed memory nodes with 8 cores per node. Comparing QUARK-D, PaRSEC and ScaLAPACK (MKL). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 17 / 24

QUARK-D: QR Weak Scaling: Large Cluster 14000 12000 10000 QR factorization with weak scaling (5000x5000/core) 100 nodes with 2x6 2.6GHz Opteron Istanbul cores/node [kraken] Theoretical Peak Performance D Peak Performance (single core) PARSEC (NB=260) ScaLAPACK (proc per core,nb=180) QUARKD (NB=468) GFlops 8000 6000 4000 2000 0 0 200 400 600 800 1000 1200 Number of cores (12 cores/node) Figure: Weak scaling performance for QR factorization (DGEQRF) of a matrix (5000x5000/per core) on 1200 cores (100 distributed memory nodes with 12 cores per node). Comparing QUARK-D, PaRSEC and ScaLAPACK (libsci). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 18 / 24

QUARK-D: Cholesky Weak Scaling: Small Cluster 1200 1000 Dist Mem - 16 nodes (2x4-core) 2.27 GHz Xeon [dancer] Theoretical Peak Performance PARSEC/DPLASMA (nb=240) QUARKD (nb=312) ScaLAPACK (nb=96) 800 GFlops 600 400 200 0 0 20 40 60 80 100 120 140 Number of cores (8 cores/node) Figure: Weak scaling performance of Cholesky factorization (DPOTRF) of a matrix (5000x5000/per core) on 16 distributed memory nodes with 8 cores per node. Comparing QUARK-D, PaRSEC and ScaLAPACK (MKL). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 19 / 24

QUARK-D: Cholesky Weak Scaling: Large Cluster 14000 12000 10000 Cholesky factorization with weak scaling (5000x5000/core) 100 nodes with 2x6 2.6GHz Opteron Istanbul cores/node [kraken] Theoretical Peak Performance D Peak Performance (single core) PARSEC (NB=288) QUARKD (NB=468) ScaLAPACK (NB=180) GFlops 8000 6000 4000 2000 0 0 200 400 600 800 1000 1200 Number of cores (12 cores/node) Figure: Weak scaling performance for Cholesky factorization (DPOTRF) of a matrix (7000x7000/per core) on 1200 cores (100 distributed memory nodes with 12 cores per node). Comparing QUARK-D, PaRSEC and ScaLAPACK (libsci). Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 20 / 24

DAG Composition: Cholesky Inversion Cholesky Inversion POTRF, TRTRI, LAUUM DAG composition can compress DAGs substantially POTRF TRTRI LAUUM Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 21 / 24

QUARK-D: Composing Cholesky Inversion Figure: Trace of the distributed memory Cholesky inversion of a matrix with three DAGs that are composed (POTRF, TRTRI, LAUUM) Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 22 / 24

QUARK-D Summary Designed and implemented a runtime system for task based applications on distributed memory architectures. Uses serial task insertion interface with automatic data dependency inference. No global coordination for task scheduling. Distributed data coherency protocol manages copies of data. Fast communication engine transfers data asynchronously. Focus on productivity, scalability and performance. Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 23 / 24

The End Asim YarKhan & Jack Dongarra (UTK) Lightweight Superscalar Task Execution Nov 16 2014 24 / 24