Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Collaborators Thesis Advisors Denis Barthou (Labri/INRIA Bordeaux) David Padua (University of Illinois at UC) Future collaboration starpu team, INRIA Bordeaux Jointlab INRIA-UIUC 2

Objectives and Contributions INTRODUCTION Jointlab INRIA-UIUC 3

Objectives Compile linear algebra equations Compute X for L X X U = C [CTSY] Compute L and U for L U = A [LU] Generate efficient task parallel code Identify tasks Generate task dependence graph Jointlab INRIA-UIUC 4

Motivation Focus is on linear algebra Start from high level description No code or algorithm Derivation through blocking of operands Data centric approach Derivation for parallelism Output is a parallel task graph Jointlab INRIA-UIUC 5

Contributions A specification language Express computation Characterize operands (shapes) Identify wanted result Derivation rules Validity/applicability patterns Operators symbolic execution rules Dependence build engine Jointlab INRIA-UIUC 6

A detailed view of the generator SYSTEM DESCRIPTION Jointlab INRIA-UIUC 7

Description Language - Operands %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy All operands (Type inference) Status Known, Unknown Shape Triangular, diagonal Type Matrix, (vector, scalar) Modifiers (transpose) (Sizes) (Density) Jointlab INRIA-UIUC 8

Description language - Equation %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy Base for decomposition Simple equations Assignments X = A*B Solvers LU Triangular Sylvester L*X=B Cholesky Jointlab INRIA-UIUC 9

Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% Parameters @name ctsy Drive code generation Set of parameters Name Codelet Order of operands Data type For customization Default values used Jointlab INRIA-UIUC 10

Kernel Declaration Generate a full solution Cost of full recursion Usually not a good idea Use existing kernels and libraries Already optimized Jointlab INRIA-UIUC 11

Performance Kernel Performance Measure performance Depend on size target Guide exploration Optimal nodes Similar to ATLAS 2 4 6 8 16 32 64 128 Problem Size Jointlab INRIA-UIUC 12

Starpu Codelet Codelet declaration void gemm(void *buffers[], void *cl_arg); struct starpu_codelet gemm_cl = {.where = STARPU_CPU,.cpu_funcs = { gemm, NULL},.nbuffers = 3,.modes = {STARPU_R, STARPU_R, STARPU_RW} }; Kernel code void gemm(void *buffers[], void *cl_arg) { struct params *params = cl_arg; int n = params->n; double *a = (double *) STARPU_MATRIX_GET_PTR(buffers[0]); // cblas_dgemm(cblasrowmajor, CBlasNoTrans, CBlasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); } Jointlab INRIA-UIUC 13

Defining blocking space Blocking defines the space of solutions Must only generate valid solutions Look at the equation s operation tree Each node gets a set of dimensions Generate constraints depending on operation Jointlab INRIA-UIUC 14

Validity of Blocking y A x B y B B <x A,y B > A B <x A,y A > <x B,y B > x A A New Constraint { y A = x B } Jointlab INRIA-UIUC 15

Valid Blockings - DTSY A <x A,y A > <x A,y X > <x A,y B > X <x X,y X > - (C) <x C,y C > B <x B,y B > X <x X,y X > Constraints y A = x X y X = x B x A = x X y B = y X x C = x X y C = y X { x A,y A,x X, x C } { x B,y B,y X,y C } Jointlab INRIA-UIUC 16

Valid Blockings DTSY (2) A*X*B X = C A lower triangular, B upper triangular xa = ya = xx = 2 xb = yb = yx = 2 A(0,0) A(0,1) X(0,0) X(0,1) B(0,0) B(0,1) X(0,0) X(0,1) T(0,0) T(0,1).... 0. 0 0. 0.... 0..... 0 0. 0.... * * - = 0 0. 0. 0.... 0. 0 0..... 0.... A(1,0) A(1,1) X(1,0) X(1,1) B(1,0) B(1,1) X(1,0) X(1,1) T(1,0) T(1,1) xa = ya = xx = 2 xb = yb = yx = 1 A(0,0) A(0,1) X(0,0).. 0. 0. 0. * * 0. 0 0.. A(1,0) A(1,1) X(1,0) B(0,0) <=> B. 0 0 0 0 0. 0 X(0,0) - = X(1,0) T(0,0) T(1,0) xa = ya = xx = 3 xb = yb = yx = 3 A(0,0) 0. 0. 0. 0. 0. 0 0 0 0 0 0 0 0 0 0. A(2,2) X(0,0) X(2,2) B(0,0). 0 0 0 0 0. 0. 0 0 0 0. 0. 0. 0. 0. 0 B(2,2) X(0,0) X(2,2) * * - = C(0,0) C(2,2) Jointlab INRIA-UIUC 17

Derivation example 0. 0 0 0 0 0. * = A X T blocking Sym. Exec. A(0,0) A(0,1) X(0,0) X(0,1) T(0,0) T(0,1)...... 0...... * = 0 0.... 0 0 0...... A(1,0) A(1,1) X(1,0) X(1,1) T(1,0) T(1,1) T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,0)*X(0,0) + A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,1)*X(1,0) T(1,1) = A(1,1)*X(1,1) Jointlab INRIA-UIUC 18

Operand characterization Blocks of X are outputs (unknown) A(0,0) and A(1,1) are lower triangular Equation Input Output T(1,1) = A(1,1) X(1,1) T(1,1) A(1,1) X(1,1) T(1,0) = A(1,1) X(1,0) T(1,0) A(1,1) X(1,0) T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(0,1) X(1,1) T(0,0) = A(0,0) X(0,0) + A(0,1) X(1,0) T(0,0) A(0,0) A(0,1) X(0,0) X(1,0) Jointlab INRIA-UIUC 19

Equation Signature Need to identify equations Identification through types and operators A X = T : LT UNK = MT Set of simplification rules UNK + MT MT UNK + MT Jointlab INRIA-UIUC 20

Identify task T(1,1) = A(1,1) X(1,1) Signature : LT UNK = MT Instance of original problem Solvable X(1,1) can now be considered known Jointlab INRIA-UIUC 21

Building dependence Find instances of X(1,1) Shift in input set Add dependence edge Equation Input Output T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(1, 1) X(0,1) New Dependence {T 1,1 = A 1,1 X 1,1 T 0,1 = A 0,0 X 0,1 + A 0,1 X(1,1) } Jointlab INRIA-UIUC 22

Signature Simplification T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 1. MT = LT UNK + MT MT 2. MT = LT UNK + MT 3. MT MT = LT UNK 4. MT = LT UNK Match to the original problem! Jointlab INRIA-UIUC 23

Post identification expansion Simplification step Corresponding equation MT = LT UNK + MT MT T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 MT = LT UNK + MT MT MT = LT UNK MT = LT UNK T1 = A 01 X(1,1) T 0,1 = A 0,0 X 0,1 + T1 T1 = A 01 X(1, 1) T2 = T 0, 1 T1 T2 = A 0, 0 X(0, 1) Jointlab INRIA-UIUC 24

Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T 1 2 3 4 1 1 2 3 2 3 4 4 4 Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00 U00 + L11 X10 U00 X10 = C10 3: L00 X00 U01 + L00 X01 U11 X01 = C01 4: L10 X00 U01 + + L11 X11 U11 X11 = C11 ø 1: L 11 00 X 11 00 U 11 00 X 11 00 = T00 2: L 11 10 X 11 00 U 11 00 + L 11 11 X 11 10 U 11 00 X 11 10 = T10 3: L 11 00 X 11 00 U 11 01 + L 11 00 X 11 01 U 11 11 X 11 01 = T01 4: L 11 10 X 11 00 U 11 01 + + L 11 11 X 11 11 U 11 11 X 11 11 = T11 1 2 3 4 1 1 2 3 2 3 4 4 4 Dependence Graph Termination 1 2 3 4 1 2 3 4 1 1 2 3 2 3 4 4 1 2 3 4 Final Graph 1 2 3 4 1 1 2 3 2 3 4 4 1 2 3 4 Jointlab INRIA-UIUC 25

Simple graph example Jointlab INRIA-UIUC 26

Graph Example Jointlab INRIA-UIUC 27

Heterogeneous Graph Example MM DTSY DTSY Jointlab INRIA-UIUC 28

INITIAL RESULTS AND CHALLENGES Jointlab INRIA-UIUC 29

Challenges Version generation time 5 minutes in generator 20 minutes compiling Generated code size 500k lines of code in single function (icc segfaults) Jointlab INRIA-UIUC 30

Speedup Trisolve (L*X=B) 7 6 5 4 3 2 ref 2 4 8 10 1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 12000 14000 16000 18000 Problem Size Jointlab INRIA-UIUC 31

Conclusion Faster development cycle for architectures No time to hand-tune everything anymore Can t hand tune for every HW iteration Increased complexity Heterogeneous systems Description of an automatic methodology Leverage existing small scale libraries Jointlab INRIA-UIUC 32

AU CAS OU Jointlab INRIA-UIUC 33

Exploiting heterogeneous machines : starpu Runtime Goal: Scheduling ( offloading) tasks over heterogeneous machines CPU + GPU + SPU = *PU Auto-tuning of performance models Optimization of memory transfers Target for Compilers StarSs [UPC], HMPP [CAPS] Libraries PLASMA/MAGMA [UTK] Applications Fast multipole methods [CESTA] StarPU provides an Open Scheduling platform Scheduling algorithm = plugins CPU CPU CPU CPU CPU CPU M. CPU CPU M. GPU GPU M. M. 800 700 600 500 400 300 200 100 0 Speed (GFlops) Jointlab INRIA-UIUC 34 Greedy task model prefetch data model

Overview of StarPU Maximizing PU occupancy, minimizing data transfers Principle Accept tasks that may have multiple implementations Together with potential interdependencies Leads to a dynamic acyclic graph of tasks Data-flow approach Provide a high-level data management layer Application should only describe Which data may be accessed by tasks How data may be divided Parallel Compilers CPU HPC Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL GPU Jointlab INRIA-UIUC 35

Code styles and optimization for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) for(int k = 0 ; k < N ; k++) c[i][j] += a[i][k] * b[k][j]; for(int kk = 0 ; kk < N ; kk+=b) for(int ii = 0 ; ii < N ; ii+=b) for(int jj = 0 ; jj < N ; jj+=b) for(int i = ii ; i < ii+b ; i++) for(int k = kk ; k < kk+b ; k++) { c_i = c[i]; a_ik = a[i][k]; b_k = b[k]; for(int j = jj ; i < jj+b ; i+=2) { c_i[j] += a_ik * b_k[j]; c_u[j+1] += a_ik * b_k[j+1]; } } Alexandre X. Duchateau - Preliminary Examination 36

Empirical search Generation of multiple solutions Algorithmic exploration Theoretical evaluation Usually unreliable Filtering heuristic Empirical evaluation (execution) slow (very slow when looking at bad versions) Alexandre X. Duchateau - Preliminary Examination 37

Generalized system overview Driver Description Generator Versions Prediction Execution Selection Predictor Filters what versions are actually executed Driver Guide the search Alexandre X. Duchateau - Preliminary Examination 38