Dynamic Scheduling within MAGMA

Size: px

Start display at page:

Download "Dynamic Scheduling within MAGMA"

Philip Hudson
5 years ago
Views:

Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

1 Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing Laboratory ICL University of Tennessee, Knoxville

2 ICL, Knoxville - Tennessee 2 April 5, 2012 M. Faverge - MAGMA

3 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 3 April 5, 2012 M. Faverge - MAGMA

4 Hardware Trends 1 But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip 1.E+07 Scale # cores instead of clock speed Multicore - Hybrid Hardware issue became software issue 1.E+06 Transistors (in Thousands) Frequency (MHz) 1.E+05 Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Berkeley ParLab 3 1 Figure from Kathy Yelick, Ten Ways to Waste a Parallel Computer. Chris Batten, and Krste Asanoviç Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 4 April 5, 2012 M. Faverge - MAGMA

Future Systems # GPU accelerated systems in Top 500 40 30 20 Most likely

Future accelerators integrated Intel s MIC Knight s Corner AMD s Fusion

5 Future Systems # GPU accelerated systems in Top Most likely hybrid design Multicore + GPU accelerators Today accelerators attached Future accelerators integrated Intel s MIC Knight s Corner AMD s Fusion Nvidia s Project Denver April 5, 2012 M. Faverge - MAGMA

Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s

6 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

7 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

8 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

9 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? critical path 6 April 5, 2012 M. Faverge - MAGMA

10 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? critical path 6 April 5, 2012 M. Faverge - MAGMA

11 Actual projects DAGuE Directed Acyclic Graph Unified Environment Target multi-gpus + multi-cores in distributed memory Scheduler relying on parametrized DAG to represent the dependencies DPLASMA and DSPARSE librairies MORSE Matrices Over Runtime Exascale Target multi-gpus + multi-cores in distributed memory Interface to use PLASMA algorithms on top of an external scheduler (StarPU) 7 April 5, 2012 M. Faverge - MAGMA

12 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 8 April 5, 2012 M. Faverge - MAGMA

13 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 9 April 5, 2012 M. Faverge - MAGMA

14 MAGMA 1.1 Linear algebra library for GPUs. LAPACK column-wise layout LAPACK-like C and Fortran interfaces CPU and GPU interfaces 10 April 5, 2012 M. Faverge - MAGMA

15 MAGMA 1.1 MAGMA BLAS GPU only kernels Same interface as CUBLAS Improves some of the CUBLAS routines MAGMA LAPACK Mostly GPU computations Hybrid kernels 1 GPU + CPU Blas multithreaded or not CPU and GPU interfaces Gflop/s matrix-vector multiply magma ssymv cublas ssymv magma dsymv cublas dsymv Matrix size 11 April 5, 2012 M. Faverge - MAGMA

16 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix A = QA Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

17 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

18 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

19 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Panel Trailing matrix 12 April 5, 2012 M. Faverge - MAGMA

20 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Panel 12 April 5, 2012 M. Faverge - MAGMA

21 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 13 April 5, 2012 M. Faverge - MAGMA

22 Tile Algorithms (PLASMA) Parallelism is brought to the fore May require the redesign of linear algebra algorithms Remove unnecessary synchronization points DAG execution where nodes represent tasks and edges define dependencies between them Tile data layout Dynamic runtime system environment 14 April 5, 2012 M. Faverge - MAGMA

23 Data Layout LAPACK: column-major format PLASMA: tile format Improves cache locality Simplifies data transfert 15 April 5, 2012 M. Faverge - MAGMA

24 Tile QR algorithm First panel factorization and corresponding updates DAG for a 4 4 tiles matrix GEQRT ORMQR ORMQR TSQRT ORMQR TSMQR TSMQR TSQRT TSMQR TSMQR GEQRT TSMQR TSQRT TSMQR ORMQR TSMQR TSQRT TSMQR ORMQR TSMQR TSMQR TSQRT TSMQR TSMQR GEQRT TSMQR ORMQR TSQRT TSMQR GEQRT 16 April 5, 2012 M. Faverge - MAGMA

25 QR - 32x4 tile matrix Try to enlarge the DAG. 17 April 5, 2012 M. Faverge - MAGMA

26 CAQR - 32x4 tile matrix - 16 domains Try to enlarge the DAG = new algorithm. How to schedule this problem efficiently? 18 April 5, 2012 M. Faverge - MAGMA

27 Dynamic Scheduling Conceptually similar to out-of-order processor scheduling Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer 19 April 5, 2012 M. Faverge - MAGMA

28 Code example (QR) for (k = 0; k < min(mt, NT); k++){ inserttask(zgeqrt, Akk, INOUT ); for (n = k+1; n < NT; n++) inserttask(zunmqr, Akk, INPUT, Akn, INOUT ); for (m = k+1; m < MT; m++){ inserttask(ztsqrt, Akk, INOUT, Amk, INOUT); } } for (n = k+1; n < NT; n++) inserttask(ztsmqr, Amk, INPUT, Akn, INOUT, Amn, INOUT); 20 April 5, 2012 M. Faverge - MAGMA

29 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 21 April 5, 2012 M. Faverge - MAGMA

30 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 22 April 5, 2012 M. Faverge - MAGMA

31 Tile algorithms for Hybrid architectures Pros Cons No constraint about computation locality in the DAG description No explicit MPI communications easy composition between steps of algorithms No need to change the algorithm to take into account new architectures Lots of algorithms cannot be easily expressed as a DAG Tuning can be require to get good performances May require identical kernels on CPUs and GPUs (Input/Ouput) Granularity can t change easily 23 April 5, 2012 M. Faverge - MAGMA

32 Tile algorithms for Hybrid architectures Pros Cons No constraint about computation locality in the DAG description No explicit MPI communications easy composition between steps of algorithms No need to change the algorithm to take into account new architectures Lots of algorithms cannot be easily expressed as a DAG Tuning can be require to get good performances May require identical kernels on CPUs and GPUs (Input/Ouput) Granularity can t change easily 24 April 5, 2012 M. Faverge - MAGMA

33 Why StarPU? Pros: Memory management Task submission system similar to Quark Cost models which analyse in real time kernels efficiencies Several scheduling strategies available Using GPUs is straightforward No specific compiler Cons: No MPI support (before v1.0) WaR require copies from the user NUMA support is not optimized yet Too many scheduling strategies 25 April 5, 2012 M. Faverge - MAGMA

34 From Multicore to Hybrid Architectures Figure: PLASMA Architecture 26 April 5, 2012 M. Faverge - MAGMA

35 From Multicore to Hybrid Architectures Figure: MAGMA Architecture 26 April 5, 2012 M. Faverge - MAGMA

36 How does it works? One worker per unit CPU or couple CPU/GPU MAGMA kernels require a GPU AND a CPU Streams are not used Simple replacement of quark_insert_task by morse_insert_task CPU and GPU kernels must take the same inputs and return the same outputs 27 April 5, 2012 M. Faverge - MAGMA

37 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 28 April 5, 2012 M. Faverge - MAGMA

38 Impact of the scheduling policy Gflop/s Matrix order HEFT-TMDM-PR HEFT-TMDM HEFT-TM-PR HEFT-TM GREEDY Name Policy description greedy Greedy policy heft-tm HEFT based on Task duration Models (T data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tmdp-pr heft-tmdp with data PRefetch 29 April 5, 2012 M. Faverge - MAGMA

39 Impact of the data penalty Matrix order heft-tm-pr 3.8 GB 57.2 GB GB GB heft-tmdm-pr 1.9 GB 16.3 GB 25.4 GB 41.6 GB Impact of the scheduling policy on the total amount of data transfers during sgeqrf. 30 April 5, 2012 M. Faverge - MAGMA

40 Scalability GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order Scalability of sgeqrf and dgeqrf on Opteron-Tesla 31 April 5, 2012 M. Faverge - MAGMA

41 Scalability GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order Scalability of sgeqrf and dgeqrf on Opteron-Tesla + 200Gflop/s but 12 cores = 150Gflop/s 31 April 5, 2012 M. Faverge - MAGMA

42 Scalability Kernel CPU GPU Speedup sgeqrt 9 Gflops 60 Gflops 6 stsqrt 12 Gflops 67 Gflops 6 sormqr 8.5 Gflops 227 Gflops 27 stsmqr 10 Gflops 285 Gflops 27 Task distribution observed on StarPU: sgeqrt: 20% of tasks on GPUs stsmqr: 92.5% of tasks on GPUs Taking advantage of heterogeneity! Only do what you are good for 32 April 5, 2012 M. Faverge - MAGMA

43 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 33 April 5, 2012 M. Faverge - MAGMA

44 GPU kernels can be different 1000 Mix variant 450 Mix variant Gflop/s Gflop/s Matrix order (a) SGETRF (96,2496) Matrix order (b) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

45 GPU kernels can be different 1000 Mix-swp variant Mix variant 450 Mix-swp variant Mix variant Gflop/s Gflop/s Matrix order (c) SGETRF (96,2496) Matrix order (d) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

46 GPU kernels can be different 1000 Mix-trtri variant Mix-swp variant Mix variant Mix-trtri variant Mix-swp variant Mix variant Gflop/s Gflop/s Matrix order (e) SGETRF (96,2496) Matrix order (f) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

47 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 35 April 5, 2012 M. Faverge - MAGMA

48 Auto-Tuning 36 April 5, 2012 M. Faverge - MAGMA

49 Auto-Tuning 37 April 5, 2012 M. Faverge - MAGMA

50 Intro. MAGMA Tile-Algo How to choose IB and NB? Auto-Tuning??? 38 April 5, 2012, M. Faverge - MAGMA Morse Concl.

51 How to choose IB and NB? Run the experiments on a set of different (IB, NB) couples: Only 1 GPU Matrix size of x NB 5120 on Tesla (10240 on Fermi) Register the average performance of the update kernel Register the performance of the global factorization 39 April 5, 2012 M. Faverge - MAGMA

52 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

53 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (32, 1984) (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

54 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (64, 1984) (32, 1984) (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

55 LU on 3 Fermis + 2 Intel hexacore All 3 GPUs 2 GPUs 1 GPU 500 All 3 GPUs 2 GPUs 1 GPU 12 CPUs Gflop/s Gflop/s Matrix order Matrix order (d) SGETRF (96,2496) (e) DGETRF (64,2048) 41 April 5, 2012 M. Faverge - MAGMA

56 Summary of the one-sided factorizations 1 Cholesky 1.3 TFlops VS 1.1 with static scheduling 2 QR 1. TFlops VS 0.8 with static scheduling 3 LU 1.1 TFlops VS 0.9 with static scheduling StarPU brings the performance of the CPUs Improvment due to CPUs is higher than CPUs theoritical performance 42 April 5, 2012 M. Faverge - MAGMA

57 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 43 April 5, 2012 M. Faverge - MAGMA

58 Conclusion Fermi gives really good results in double precision Problem is the huge difference between CPU and GPU Numerical stabilities due to pairwise pivoting in LU Provides: Cholesky, QR, CAQR and LU factorizations and solves Subset of the BLAS-3 subroutines Cholesky inversion Move to two sided factorizations (eigenvalue and singular value problems) Move to distributed memory 44 April 5, 2012 M. Faverge - MAGMA

59 Useful Links PLASMA = MAGMA = MORSE = StarPU = 45 April 5, 2012 M. Faverge - MAGMA

60 Thank you! 46 April 5, 2012 M. Faverge - MAGMA

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7