Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Size: px

Start display at page:

Download "Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign"

Joleen Watkins
5 years ago
Views:

1 Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign

2 Collaborators Thesis Advisors Denis Barthou (Labri/INRIA Bordeaux) David Padua (University of Illinois at UC) Future collaboration starpu team, INRIA Bordeaux Jointlab INRIA-UIUC 2

3 Objectives and Contributions INTRODUCTION Jointlab INRIA-UIUC 3

4 Objectives Compile linear algebra equations Compute X for L X X U = C [CTSY] Compute L and U for L U = A [LU] Generate efficient task parallel code Identify tasks Generate task dependence graph Jointlab INRIA-UIUC 4

5 Motivation Focus is on linear algebra Start from high level description No code or algorithm Derivation through blocking of operands Data centric approach Derivation for parallelism Output is a parallel task graph Jointlab INRIA-UIUC 5

6 Contributions A specification language Express computation Characterize operands (shapes) Identify wanted result Derivation rules Validity/applicability patterns Operators symbolic execution rules Dependence build engine Jointlab INRIA-UIUC 6

7 A detailed view of the generator SYSTEM DESCRIPTION Jointlab INRIA-UIUC 7

8 Description Language - Operands %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy All operands (Type inference) Status Known, Unknown Shape Triangular, diagonal Type Matrix, (vector, scalar) Modifiers (transpose) (Sizes) (Density) Jointlab INRIA-UIUC 8

9 Description language - Equation %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Base for decomposition Simple equations Assignments X = A*B Solvers LU Triangular Sylvester L*X=B Cholesky Jointlab INRIA-UIUC 9

Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation

10 Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Drive code generation Set of parameters Name Codelet Order of operands Data type For customization Default values used Jointlab INRIA-UIUC 10

11 Kernel Declaration Generate a full solution Cost of full recursion Usually not a good idea Use existing kernels and libraries Already optimized Jointlab INRIA-UIUC 11

12 Performance Kernel Performance Measure performance Depend on size target Guide exploration Optimal nodes Similar to ATLAS Problem Size Jointlab INRIA-UIUC 12

13 Starpu Codelet Codelet declaration void gemm(void *buffers[], void *cl_arg); struct starpu_codelet gemm_cl = {.where = STARPU_CPU,.cpu_funcs = { gemm, NULL},.nbuffers = 3,.modes = {STARPU_R, STARPU_R, STARPU_RW} }; Kernel code void gemm(void *buffers[], void *cl_arg) { struct params *params = cl_arg; int n = params->n; double *a = (double *) STARPU_MATRIX_GET_PTR(buffers[0]); // cblas_dgemm(cblasrowmajor, CBlasNoTrans, CBlasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); } Jointlab INRIA-UIUC 13

14 Defining blocking space Blocking defines the space of solutions Must only generate valid solutions Look at the equation s operation tree Each node gets a set of dimensions Generate constraints depending on operation Jointlab INRIA-UIUC 14

15 Validity of Blocking y A x B y B B <x A,y B > A B <x A,y A > <x B,y B > x A A New Constraint { y A = x B } Jointlab INRIA-UIUC 15

16 Valid Blockings - DTSY A <x A,y A > <x A,y X > <x A,y B > X <x X,y X > - (C) <x C,y C > B <x B,y B > X <x X,y X > Constraints y A = x X y X = x B x A = x X y B = y X x C = x X y C = y X { x A,y A,x X, x C } { x B,y B,y X,y C } Jointlab INRIA-UIUC 16

17 Valid Blockings DTSY (2) A*X*B X = C A lower triangular, B upper triangular xa = ya = xx = 2 xb = yb = yx = 2 A(0,0) A(0,1) X(0,0) X(0,1) B(0,0) B(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * * - = A(1,0) A(1,1) X(1,0) X(1,1) B(1,0) B(1,1) X(1,0) X(1,1) T(1,0) T(1,1) xa = ya = xx = 2 xb = yb = yx = 1 A(0,0) A(0,1) X(0,0) * * A(1,0) A(1,1) X(1,0) B(0,0) <=> B X(0,0) - = X(1,0) T(0,0) T(1,0) xa = ya = xx = 3 xb = yb = yx = 3 A(0,0) A(2,2) X(0,0) X(2,2) B(0,0) B(2,2) X(0,0) X(2,2) * * - = C(0,0) C(2,2) Jointlab INRIA-UIUC 17

Derivation example 0. 0 0 0 0 0. * = A X T blocking Sym. Exec.

A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0)

18 Derivation example * = A X T blocking Sym. Exec. A(0,0) A(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * = A(1,0) A(1,1) X(1,0) X(1,1) T(1,0) T(1,1) T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,0)*X(0,0) + A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,1)*X(1,0) T(1,1) = A(1,1)*X(1,1) Jointlab INRIA-UIUC 18

19 Operand characterization Blocks of X are outputs (unknown) A(0,0) and A(1,1) are lower triangular Equation Input Output T(1,1) = A(1,1) X(1,1) T(1,1) A(1,1) X(1,1) T(1,0) = A(1,1) X(1,0) T(1,0) A(1,1) X(1,0) T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(0,1) X(1,1) T(0,0) = A(0,0) X(0,0) + A(0,1) X(1,0) T(0,0) A(0,0) A(0,1) X(0,0) X(1,0) Jointlab INRIA-UIUC 19

20 Equation Signature Need to identify equations Identification through types and operators A X = T : LT UNK = MT Set of simplification rules UNK + MT MT UNK + MT Jointlab INRIA-UIUC 20

21 Identify task T(1,1) = A(1,1) X(1,1) Signature : LT UNK = MT Instance of original problem Solvable X(1,1) can now be considered known Jointlab INRIA-UIUC 21

22 Building dependence Find instances of X(1,1) Shift in input set Add dependence edge Equation Input Output T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(1, 1) X(0,1) New Dependence {T 1,1 = A 1,1 X 1,1 T 0,1 = A 0,0 X 0,1 + A 0,1 X(1,1) } Jointlab INRIA-UIUC 22

23 Signature Simplification T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 1. MT = LT UNK + MT MT 2. MT = LT UNK + MT 3. MT MT = LT UNK 4. MT = LT UNK Match to the original problem! Jointlab INRIA-UIUC 23

24 Post identification expansion Simplification step Corresponding equation MT = LT UNK + MT MT T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 MT = LT UNK + MT MT MT = LT UNK MT = LT UNK T1 = A 01 X(1,1) T 0,1 = A 0,0 X 0,1 + T1 T1 = A 01 X(1, 1) T2 = T 0, 1 T1 T2 = A 0, 0 X(0, 1) Jointlab INRIA-UIUC 24

Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T 1 2 3 4 1 1 2 3 2 3 4 4 4 Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00

25 Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00 U00 + L11 X10 U00 X10 = C10 3: L00 X00 U01 + L00 X01 U11 X01 = C01 4: L10 X00 U L11 X11 U11 X11 = C11 ø 1: L X U X = T00 2: L X U L X U X = T10 3: L X U L X U X = T01 4: L X U L X U X = T Dependence Graph Termination Final Graph Jointlab INRIA-UIUC 25

26 Simple graph example Jointlab INRIA-UIUC 26

27 Graph Example Jointlab INRIA-UIUC 27

28 Heterogeneous Graph Example MM DTSY DTSY Jointlab INRIA-UIUC 28

29 INITIAL RESULTS AND CHALLENGES Jointlab INRIA-UIUC 29

30 Challenges Version generation time 5 minutes in generator 20 minutes compiling Generated code size 500k lines of code in single function (icc segfaults) Jointlab INRIA-UIUC 30

31 Speedup Trisolve (L*X=B) ref Problem Size Jointlab INRIA-UIUC 31

32 Conclusion Faster development cycle for architectures No time to hand-tune everything anymore Can t hand tune for every HW iteration Increased complexity Heterogeneous systems Description of an automatic methodology Leverage existing small scale libraries Jointlab INRIA-UIUC 32

33 AU CAS OU Jointlab INRIA-UIUC 33

34 Exploiting heterogeneous machines : starpu Runtime Goal: Scheduling ( offloading) tasks over heterogeneous machines CPU + GPU + SPU = *PU Auto-tuning of performance models Optimization of memory transfers Target for Compilers StarSs [UPC], HMPP [CAPS] Libraries PLASMA/MAGMA [UTK] Applications Fast multipole methods [CESTA] StarPU provides an Open Scheduling platform Scheduling algorithm = plugins CPU CPU CPU CPU CPU CPU M. CPU CPU M. GPU GPU M. M Speed (GFlops) Jointlab INRIA-UIUC 34 Greedy task model prefetch data model

35 Overview of StarPU Maximizing PU occupancy, minimizing data transfers Principle Accept tasks that may have multiple implementations Together with potential interdependencies Leads to a dynamic acyclic graph of tasks Data-flow approach Provide a high-level data management layer Application should only describe Which data may be accessed by tasks How data may be divided Parallel Compilers CPU HPC Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL GPU Jointlab INRIA-UIUC 35

36 Code styles and optimization for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) for(int k = 0 ; k < N ; k++) c[i][j] += a[i][k] * b[k][j]; for(int kk = 0 ; kk < N ; kk+=b) for(int ii = 0 ; ii < N ; ii+=b) for(int jj = 0 ; jj < N ; jj+=b) for(int i = ii ; i < ii+b ; i++) for(int k = kk ; k < kk+b ; k++) { c_i = c[i]; a_ik = a[i][k]; b_k = b[k]; for(int j = jj ; i < jj+b ; i+=2) { c_i[j] += a_ik * b_k[j]; c_u[j+1] += a_ik * b_k[j+1]; } } Alexandre X. Duchateau - Preliminary Examination 36

37 Empirical search Generation of multiple solutions Algorithmic exploration Theoretical evaluation Usually unreliable Filtering heuristic Empirical evaluation (execution) slow (very slow when looking at bad versions) Alexandre X. Duchateau - Preliminary Examination 37

38 Generalized system overview Driver Description Generator Versions Prediction Execution Selection Predictor Filters what versions are actually executed Driver Guide the search Alexandre X. Duchateau - Preliminary Examination 38

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric