Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen

Size: px

Start display at page:

Download "Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen"

Cameron Rogers
5 years ago
Views:

1 Advanced Restructuring Compilers Advanced Topics Spring 2009 Prof. Robert van Engelen

2 Overview Data and control dependences The theory and practice of data dependence analysis K-level loop-carried dependences Loop-independent dependences Testing dependence Dependence equations and the polyhedral model Dependence solvers Loop transformations Loop parallelization Loop reordering Unimodular loop transformations 2/20/09 HPC II Spring

3 Data and Control Dependences Fundamental property of all computations S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2 Data dependence: data is produced and consumed in the correct order S 1 IF (T.EQ.0.0) GOTO S 3 S 2 A = A / T S 3 CONTINUE Control dependence: a dependence that arises as a result of control flow 3

4 Data Dependence Definition There is a data dependence from statement S 1 to statement S 2 iff 1) both statements S 1 and S 2 access the same memory location and at least one of them stores into it, and 2) there is a feasible run-time execution path from statement S 1 to S 2 2/20/09 HPC II Spring

5 Data Dependence Classification S 1 X = S 2 = X Definition A dependence relation δ is S 1 δ S 2 S 1 = X S 2 X = S 1 δ -1 S 2 A true dependence (or flow dependence), denoted S 1 δ S 2 An anti dependence, denoted S 1 δ -1 S 2 S 1 X = S 2 X = S 1 δ o S 2 An output dependence, denoted S 1 δ o S 2 2/20/09 HPC II Spring

6 Theorem Compiling for Parallelism Any instruction statement reordering transformation that preserves every dependence in a program preserves the meaning of that program Bernstein s conditions for loop parallelism: Iteration I 1 does not write into a location that is read by iteration I 2 Iteration I 2 does not write into a location that is read by iteration I 1 Iteration I 1 does not write into a location that is written into by iteration I 2 2/20/09 HPC II Spring

7 Fortran s PARALLEL DO The PARALLEL DO requires that there are no scheduling constraints among its iterations When is it safe to convert a DO to PARALLEL DO? PARALLEL DO I = 1, N A(I+1) = A(I) + B(I) PARALLEL DO I = 1, N A(I) = A(I) + B(I) OK PARALLEL DO I = 1, N A(I-1) = A(I) + B(I) PARALLEL DO I = 1, N S = S + B(I) Not OK 7

8 Dependences Whereas the PARALLEL DO requires the programmer to check if loop iterations can be executed in parallel, an optimizing compiler must analyze loop dependences for automatic loop parallelization and vectorization There are multiple levels of granularity of parallelism a programmer or compiler can exploit, for example Instruction level parallelism (ILP) Synchronous (e.g. SIMD, also vector and SSE) loop parallelism (fine grain) aim is to parallelize inner loops as vectors ops Asynchronous (e.g. MIMD, work sharing) loop parallelism (coarse grain) aim is to schedule outer loops as tasks Task parallelism aim is to run different tasks (threads) in parallel (coarse grain) 8

9 Advanced Compiler Technology DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I)+A(J,K)*B(K,I) Powerful restructuring compilers transform loops into parallel or vector code Must analyze data dependences in loop nests besides straight-line code parallelize vectorize PARALLEL DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I)+A(J,K)*B(K,I) DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) +A(J:J+63,K)*B(K,I) 9

10 Synchronous Parallel Machines p 1,1 p 1,2 p 1,3 p 1,4 p 2,1 p 2,2 p 2,3 p 2,4 p 3,1 p 3,2 p 3,3 p 3,4 p 4,1 p 4,2 p 4,3 p 4,4 Example 4x4 processor mesh SIMD architecture: processors execute in lock-step and execute the same instruction per clock SIMD architecture: simpler (and cheaper) than MIMD Programming model: PRAM (parallel random access machine) Processors operate in lockstep executing the same instruction on different portions of the data space Application should exhibit data parallelism Control flow requires masking operations (similar to predicated instructions), which can be inefficient 10

11 Asynchronous Parallel Computers p 1 p 2 p 3 p 4 bus mem m 1 m 2 m 3 m 4 p 1 p 2 p 3 p 4 bus, x-bar, or network Multiprocessors Shared memory is directly accessible by each PE (processing element) Multicomputers Distributed memory requires message passing between PEs SMPs Shared memory in small clusters of processors, need message passing across clusters 11

12 Valid Transformations A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program valid DO I = 1, 3 S 1 A(I+1) = B(I) S 2 B(I+1) = A(I) invalid DO I = 1, 3 B(I+1) = A(I) A(I+1) = B(I) DO I = 3, 1, -1 A(I+1) = B(I) B(I+1) = A(I) 12

13 Fundamental Theorem of Dependence Any reordering transformation that preserves every dependence in a program preserves the meaning of that program 13

14 Dependence in Loops S 1 DO I = 1, N A(I+1) = A(I) + B(I) The statement instances S 1 [i] for iterations i = 1,,N represent the loop execution We have the following flow dependences: S 1 [1] A(2) = A(1) + B(1) S 1 [2] A(3) = A(2) + B(2) S 1 [3] A(4) = A(3) + B(3) S 1 [4] A(5) = A(4) + B(4) Statement instances with flow dependences S 1 [1] δ S 1 [2] S 1 [2] δ S 1 [3] S 1 [N-1] δ S 1 [N] 2/20/09 HPC II Spring

15 Iteration Vector Definition Given a nest of n loops, the iteration vector i of a particular iteration of the innermost loop is a vector of integers i = (i 1, i 2,, i n ) where i k represents the iteration number for the loop at nesting level k The set of all possible iteration vectors spans an iteration space over loop statement instances S j [i] 2/20/09 HPC II Spring

16 Iteration Vector Example S 1 DO I = 1, 3 DO J = 1, I A(I,J) = A(J,I) The iteration space of the statement at S 1 is the set of iteration vectors {(1,1), (2,1), (2,2), (3,1), (3,2), (3,3)} I (1,1) (2,1) (2,2) Example: at iteration i = (2,1) statement instance S 1 [i] assigns the value of A(1,2) to A(2,1) (3,1) J (3,2) (3,3) 2/20/09 HPC II Spring

17 Iteration Vector Ordering The iteration vectors are naturally ordered according to a lexicographical order For example, iteration (1,2) precedes (2,1) and (2,2) Definition Iteration i precedes iteration j, denoted i < j, iff 1) i[1:n-1] < j[1:n-1], or 2) i[1:n-1] = j[1:n-1] and i n < j n 2/20/09 HPC II Spring

18 Cross-Iteration Dependence Definition There exist a dependence from S 1 to S 2 in a loop nest iff there exist two iteration vectors i and j such that 1) i < j and there is a path from S 1 to S 2 2) S 1 accesses memory location M on iteration i and S 2 accesses memory location M on iteration j 3) one of these accesses is a write 2/20/09 HPC II Spring

19 Dependence Example S 1 DO I = 1, 3 DO J = 1, I A(I,J) = A(J,I) I (1,1) (2,1) (3,1) J (2,2) (3,2) (3,3) Show that the loop has no cross-iteration dependence Answer: there are no iteration vectors i and j in {(1,1), (2,1), (2,2), (3,1), (3,2), (3,3)} such that i < j and S 1 in i writes to the same element of A that is read at S 1 in iteration j or S 1 in iteration i reads an element A that is written at S 1 in iteration j 2/20/09 HPC II Spring

20 Dependence Distance and Definition Direction Vectors Given a dependence from S 1 on iteration i to S 2 on iteration j, the dependence distance vector d(i, j) is defined as d(i, j) = j - i Given a dependence from S 1 on iteration i and S 2 on iteration j, the dependence direction vector D(i, j) is defined for the k th component as D(i, j) k = < if d(i, j) k > 0 = if d(i, j) k = 0 > if d(i, j) k < 0 2/20/09 HPC II Spring

21 Direction of Dependence There is a flow dependence if the iteration vector of the write is lexicographically less than the iteration vector of the read In other words, if the direction vector is lexicographically positive D(i, j) k = (=, =,, =, <, ) All = Any <, =, or > 21 HC TD5102

22 Representing Dependences with Data Dependence Graphs DO I = 1, S 1 A(I) = B(I) * 5 S 2 C(I+1) = C(I) + A(I) Static data dependences for accesses to A and C: S 1 δ (=) S 2 and S 2 δ (<) S 2 It is generally infeasible to represent all data dependences that arise in a program Usually only static data dependences are recorded S 1 δ (=) S 2 means S 1 [i] δ S 2 [i] for all i = 1,, S 2 δ (<) S 2 means S 1 [i] δ S 2 [j] for all i,j = 1,, with i < j (<) S 1 (=) S 2 Data dependence graph A data dependence graph compactly represent data dependences in a loop nest 2/20/09 HPC II Spring

23 Example 1 J S 1 DO I = 1, 3 DO J = 1, I A(I+1,J) = A(I,J) Flow dependence between S 1 and itself on: i = (1,1) and j = (2,1): d(i, j) = (1,0), D(i, j) = (<, =) i = (2,1) and j = (3,1): d(i, j) = (1,0), D(i, j) = (<, =) i = (2,2) and j = (3,2): d(i, j) = (1,0), D(i, j) = (<, =) (<,=) I S 1 2/20/09 HPC II Spring

24 Example 2 4 J S 1 DO I = 1, 4 DO J = 1, 4 A(I,J+1) = A(I-1,J) Distance vector is (1,1) S 1 δ (<,<) S I (<,<) S1 2/20/09 HPC II Spring

25 Example 3 4 J S 1 DO I = 1, 4 DO J = 1, 5-I A(I+1,J+I-1) = A(I,J+I-1) Distance vector is (1,-1) S 1 δ (<,>) S I (<,>) S 1 2/20/09 HPC II Spring

26 Example 4 4 J S 1 S 2 DO I = 1, 4 DO J = 1, 4 B(I) = B(I) + A(I,J) A(I+1,J) = A(I,J) Distance vectors are (0,1) and (1,0) S 1 δ (=,<) S 1 S 2 δ (<,=) S 1 S 2 δ (<,=) S (=,<) (<,=) I S 1 S 2 (<,=) 2/20/09 HPC II Spring

27 Loop-Independent Dependences Definition Statement S 1 has a loop-independent dependence on S 2 iff there exist two iteration vectors i and j such that 1) S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and i = j 2) there is a control flow path from S 1 to S 2 within the iteration That is, the direction vector contains only = 2/20/09 HPC II Spring

28 K-Level Loop-Carried Dependences Definition Statement S 1 has a loop-carried dependence on S 2 iff 1) there exist two iteration vectors i and j such that S 1 refers to memory location M on iteration i and S 2 refers to M on iteration j 2) The direction vector is lexicographically positive that is, D(i, j) contains a < as its leftmost non- = component A loop-carried dependence from S 1 to S 2 is 1) lexically forward if S 2 appears after S 1 in the loop body 2) lexically backward if S 2 appears before S 1 (or if S 1 =S 2 ) The level of a loop-carried dependence is the index of the leftmost non- = of D(i, j) for the dependence 2/20/09 HPC II Spring

29 Example 1 S 1 S 2 DO I = 1, 3 A(I+1) = F(I) F(I+1) = A(I) S 1 δ (<) S 2 and S 2 δ (<) S 1 The loop-carried dependence S 1 δ (<) S 2 is forward The loop-carried dependence S 2 δ (<) S 1 is backward All loop-carried dependences are of level 1, because D(i, j) = (<) for every dependence 29

30 Example 2 S 1 DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 A(I,J,K+1) = A(I,J,K) All loop-carried dependences are of the 3 rd level, D(i, j) = (=, =, <) Level-k dependences are sometimes denoted by S x δ k S y S 1 δ (=,=,<) S 1 S 1 δ 3 S 1 Alternative notation for a level-3 dependence 2/20/09 HPC II Spring

31 Combining Direction Vectors DO I = 1, 9 S 1 A(I) = S 2 = A(10-I) DO I = 1, 10 S 1 S = S + A(I) S 1 DO I = 1, 10 DO J = 1, 10 A(J) = A(J)+1 S 1 δ (*) S 2 S 1 δ (<) S 2 S 1 δ (=) S 2 S 1 δ (>) S 2 S 1 δ (*) S 1 S 1 δ (*,=) S 1 A loop nest can have multiple different directions at the same loop level k We abbreviate a level-k direction vector component with * to denote any direction 2/20/09 HPC II Spring

32 How to Determine Dependences? A system of Diophantine dependence equations is set up and solved to test direction of dependence (flow/anti) Proving there is (no) solution to the equations means there is (no) dependence Most dependence solvers solve Diophantine equations over affine array index expressions of the form a 1 i 1 + a 2 i a n i n + a 0 where i j are loop index variables and a k are integer Non-linear subscripts are problematic Symbolic terms Function values Indirect indexing etc 2/20/09 HPC II Spring

33 Dependence Equation: Testing Flow Dependence DO I = 1, N S 1 A(f(I)) = A(g(I)) write read To determine flow dependence: prove the dependence equation f(α) = g(β) has a solution for iteration vectors α and β, such that α < β DO I = 1, N S 1 A(I+1) = A(I) α+1 = β has solution α=β-1 DO I = 1, N S 1 A(2*I+1) = A(2*I) 2/20/09 2α+1 = 2β has no solution HPC II Spring

34 Dependence Equation: Testing Anti Dependence DO I = 1, N S 1 A(f(I)) = A(g(I)) write read To determine anti dependence: prove the dependence equation f(α) = g(β) has a solution for iteration vectors α and β, such that β < α DO I = 1, N S 1 A(I+1) = A(I) α+1 = β has no solution DO I = 1, N S 1 A(2*I) = A(2*I+2) 2/20/09 2α = 2β+2 has solution α=β+1 HPC II Spring

35 Dependence Equation Examples To determine flow dependence: prove there are iteration vectors α < β such that f(α) = g(β) S 1 DO I = 1, N DO J = 1, N A(I-1) = A(J) α = (α I, α J ) and β = (β I, β J ) α I -1 = β J has solution DO I = 1, N S 1 A(5) = A(I) 5 = β is solution if 5 < N DO I = 1, N S 1 A(5) = A(6) 2/20/09 5 = 6 has no solution HPC II Spring

36 ZIV, SIV, and MIV S 1 DO I = DO J = DO K = A(5,I+1,J) = A(N,I,K) ZIV pair SIV pair MIV pair ZIV (zero index variable) subscript pairs SIV (single index variable) subscript pairs MIV (multiple index variable) subscript pairs 36

37 SIV pair S 1 Distance vector is (1,-1) S 1 δ (<,>) S 1 Example MIV pair DO I = 1, 4 DO J = 1, 5-I A(I+1,J+I-1) = A(I,J+I-1) First distance vector component is easy (SIV), but second is not so easy (MIV) J I 37

38 S 1 Separability and Coupled DO I = DO J = DO K = A(I,J,J) = A(I,J,K) SIV pair is separable Subscripts MIV pair gives coupled indices When testing multidimensional arrays, subscript are separable if indices do not occur in other subscripts ZIV and SIV subscripts pairs are separable MIV pairs may give rise to coupled subscript groups 38

39 Dependence Testing Overview 1. Partition subscripts into separable and coupled groups 2. Classify each subscript as ZIV, SIV, or MIV 3. For each separable subscript, apply the applicable single subscript test (ZIV, SIV, or MIV) to prove independence or produce direction vectors for dependence 4. For each coupled group, apply a multiple subscript test 5. If any test yields independence, no further testing needed 6. Otherwise, merge all dependence vectors into one set 39

40 Loop Normalization S 1 S 1 DO I = 2, 100 DO J = 1, I-1 A(I,J) = A(J,I) DO i = 0, 98 DO j = 0, i+2-2 A(i+2,j+1) = A(j+1,i+2) Some tests use normalized loop iteration spaces Set lower bound to zero (or one) by adjusting limits and occurrence of loop variable For an arbitrary loop with loop index I and loop bounds L and U and stride S, the normalized iteration number is i = (I-L)/S A normalized dependence system consists of a dependence equation along with a set of normalized constraints 2/20/09 HPC II Spring

41 S 1 Dependence System: Formulating Flow Dependence α < β such that f(α) = g(β) where α = (α I, α J ) and β = (β I, β J ) DO i = 0, 98 DO j = 0, i+2-2 A(i+2,j+1) = A(j+1,i+2) α I +2 = β J +1 α J +1 = β I +2 0 < α I, β I < 98 0 < α J < α I 0 < β J < β I α I < β I Dependence equations Loop constraints Constraint for (<,*) dep. direction Assuming normalized loop iteration spaces A dependence system consists of a dependence equation along with a set of constraints: Solution must lie within loop bounds Solution must be integer Need dependence distance or direction vector (flow/anti) 2/20/09 HPC II Spring

42 2/20/ Dependence System: Matrix Notation α I β I α J β J α I β I α J β J = < The polyhedral model: solving a linear (affine) dependence system in matrix notation Polyhedral { x : Ax < c } for some matrix A and bounds vector c A contains the loop constraints (inequalities) Rewrite dependence equation as set of inequalities: Ax = b Ax < b and -Ax < -b If no point lies in the polyhedral (a polytope) then there is no solution and thus there is no dependence HPC II Spring

43 Fourier-Motzkin Projection x 2 x 2 < x 1 x 2 < x 1 System of linear inequalities Ax < b Projections on x 1 and x 2 2/20/09 HPC II Spring

44 Fourier-Motzkin Variable x 1 x 2 Elimination (FMVE) Select x 2 : L = {3,4}, U = {1,2} new system: x 1 x 2 < max(1/2,-2) < x 1 < min(11,7) < 2 FMVE procedure: 1. Select an unknown x j 2. L = {i a ij < 0} 3. U = {i a ij > 0} 4. if L= or U= then x j is unconstrained (delete it) 5. for i L U A [i] := A [i] / a ij b i := b i / a ij 6. for i L for k U add new inequality A [i] + A [k] < b i + b k 7. Delete old rows L and U 2/20/09 HPC II Spring

45 ZIV Test DO I = 1, N S 1 A(5) = A(6) K = 10 DO I = 1, N S 1 A(5) = A(K) S 1 K = 10 DO I = 1, N DO J = 1, N A(I,5) = A(I,K) Simple and quick test The ZIV test compares two subscripts If the expressions are proved unequal, no dependence can exist 45

46 Strong SIV Test DO I = 1, N S 1 A(I+1) = A(I) DO I = 1, N S 1 A(2*I+2) = A(2*I) DO I = 1, N S 1 A(I+N) = A(I) Requires subscript pairs of the form ai+c 1 and ai +c 2 for the def I and use I, respectively Dependence equation ai+c 1 = ai +c 2 has a solution if the dependence distance d = (c 1 - c 2 )/a is integer and d < U - L with L and U loop bounds 46

47 Weak-Zero SIV Test DO I = 1, N S 1 A(I) = A(1) DO I = 1, N S 1 A(2*I+2) = A(2) DO I = 1, N S 1 A(1) = A(I) + A(1) Requires subscript pairs of the form a 1 I+c 1 and a 2 I +c 2 with either a 1 = 0 or a 2 = 0 If a 2 = 0, the dependence equation I = (c 2 - c 1 )/a 1 has a solution if (c 2 - c 1 )/a 1 is integer and L < (c 2 - c 1 )/a 1 < U If a 1 = 0, similar case 47

48 GCD Test DO I = 1, N S 1 A(2*I+1) = A(2*I) DO I = 1, N S 1 A(4*I+1) = A(2*I+3) S 1 2/20/09 DO I = 1, 10 DO J = 1, 10 A(1+2*I+20*J) = A(2+20*I+2*J) Requires that f(α) and g(β) are affine: f(α)= a 0 +a 1 α 1 + +a n α n g(β)= b 0 +b 1 β 1 + +b n β n Reordering gives linear Diophantine equation: a 1 α 1 -b 1 β 1 + +a n α n -b n β n = b 0 -a 0 which has a solution if gcd(a 1,,a n,b 1,,b n ) divides b 0 -a 0 Which of these loops have dependences? HPC II Spring

49 Banerjee Test (1) S 1 IF M > 0 THEN DO I = 1, 10 A(I) = A(I+M) + B(I) Dependence equation α = β+m is rewritten as - M + α - β = 0 Constraints: 1 < M < 0 < α < 9 0 < β < 9 f(α) and g(β) are affine: f(x 1,x 2,,x n )= a 0 +a 1 x 1 + +a n x n g(y 1,y 2,,y n )= b 0 +b 1 y 1 + +b n y n Dependence equation a 0 -b 0 +a 1 x 1 -b 1 y 1 + +a n x n -b n y n =0 has no solution if 0 is not within the lower bound L and upper bound U of the equation s LHS 49

50 Banerjee Test (2) S 1 IF M > 0 THEN DO I = 1, 10 A(I) = A(I+M) + B(I) - M + α - β = 0 Test for (*) dependence: possible dependence because 0 lies within [-,8] L U step - M + α - β - M + α - β original eq. - M + α M + α - 0 eliminate β - M M + 9 eliminate α - 8 eliminate M Constraints: 1 < M < 0 < α < 9 0 < β < 9 50

51 Banerjee Test (3) S 1 IF M > 0 THEN DO I = 1, 10 A(I) = A(I+M) + B(I) - M + α - β = 0 L U step - M + α - β - M + α - β original eq. - M + α M + α - (α+1) eliminate β - M M - 1 eliminate α eliminate M Test for (<) dependence: assume that α < β-1 No dependence because 0 does no lie within [-,-2] Constraints: 1 < M < 0 < α < β-1 < 8 1 < α+1 < β < 9 51

52 Banerjee Test (4) S 1 IF M > 0 THEN DO I = 1, 10 A(I) = A(I+M) + B(I) - M + α - β = 0 L U step - M + α - β - M + α - β original eq. - M + α - (α+1) - M + α eliminate β - M M + 9 eliminate α - 8 eliminate M Test for (>) dependence: assume that α+1 > β Possible dependence because 0 lies within [-,8] Constraints: 1 < M < 1 < β-1 < α < 9 0 < β < α+1 < 8 52

53 Banerjee Test (5) S 1 IF M > 0 THEN DO I = 1, 10 A(I) = A(I+M) + B(I) - M + α - β = 0 L U step - M + α - β - M + α - β original eq. - M + α - α - M + α - α eliminate β - M - M eliminate α eliminate M Test for (=) dependence: assume that α = β No dependence because 0 does not lie within [-,-1] Constraints: 1 < M < 0 < α = β < 9 53

54 Testing for All Direction Vectors (*,*,*) yes (may depend) (<,*,*) (=,*,*) (>,*,*) yes (<,<,*) (<,=,*) (<,>,*) no (indep) yes (<,=,<) (<,=,=) (<,=,>) Refine dependence between a pair of statements by testing for the most general set of direction vectors and upon failure refine each component recursively 54

55 Direction Vectors and Reordering Transformations (1) Definitions A reordering transformation is any program transformation that changes the execution order of the code, without adding or deleting any statement executions A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence Theorem A reordering transformation is valid if, after it is applied, the new dependence vectors are lexicographically positive None of the new direction vectors has a leftmost non- = component that is > 55

56 Direction Vectors and Reordering Transformations (2) A reordering transformation preserves all level-k dependences if 1) it preserves the iteration order of the level-k loop, 2) if it does not interchange any loop at level < k to a position inside the level-k loop, and 3) if it does not interchange any loop at level > k to a position outside the level-k loop 56

57 Examples DO I = 1, 10 S 1 F(I+1) = A(I) S 2 A(I+1) = F(I) Both S 1 δ (<) S 2 and S 2 δ (<) S 1 are level-1 dependences DO I = 1, 10 S 2 A(I+1) = F(I) S 1 F(I+1) = A(I) Find deps. Choose transform S 1 S 1 DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 A(I+1,J+2,K+3) = A(I,J,K) + B Find deps. S 1 δ (<,<,<) S 1 is a level-1 dependence Choose transform DO I = 1, 10 DO K = 10, 1, -1 DO J = 1, 10 A(I+1,J+2,K+3) = A(I,J,K) + B 57

58 Direction Vectors and Reordering Transformations (3) A reordering transformation preserves a loopindependent dependence between statements S 1 and S 2 if it does not move statement instances between iterations and preserves the relative order of S 1 and S 2 58

59 Examples DO I = 1, 10 S 1 A(I) = S 2 = A(I) S 1 δ (=) S 2 is a loop-independent dependence DO I = 10, 1, -1 S 2 A(I) = S 1 = A(I) Find deps. Choose transform S 1 S 2 DO I = 1, 10 DO J = 1, 10 A(I,J) = = A(I,J) S 1 δ (=,=) S 1 is a loop-independent dependence S 1 S 2 DO J = 10, 1, -1 DO I = 10, 1, -1 A(I,J) = = A(I,J) Find deps. Choose transform 59

60 Loop Fission S 1 DO I = 1, 10 S 2 DO J = 1, 10 S 3 A(I,J) = B(I,J) + C(I,J) S 4 D(I,J) = A(I,J-1) * 2.0 S 5 S 6 S 3 δ (=,<) S 4 S 1 DO I = 1, 10 S 2 DO J = 1, 10 S 3 A(I,J) = B(I,J) + C(I,J) S x S y DO J = 1, 10 S 4 D(I,J) = A(I,J-1) * 2.0 S 5 S 6 S 3 δ (=,<) S 4 Loop fission (or loop distribution) splits a single loop into multiple loops Enables vectorization Enables parallelization of separate loops if original loop is sequential Loop fission must preserve all dependence relations of the original loop S 1 PARALLEL DO I = 1, 10 S 3 A(I,1:10)=B(I,1:10)+C(I,1:10) S 4 D(I,1:10)=A(I,0:9) * 2.0 S 6 S 3 δ (=,<) S 4 60

61 Loop Fission: Algorithm S 1 DO I = 1, 10 S 2 A(I) = A(I) + B(I-1) S 3 B(I) = C(I-1)*X + Z S 4 C(I) = 1/B(I) S 5 D(I) = sqrt(c(i)) S 6 S 3 δ (<) S 2 S 4 δ (<) S 3 S 3 δ (=) S 4 S 4 δ (=) S 5 Compute the acyclic condensation of the dependence graph to find a legal order of the loops 1 1 S 2 S 3 0 S 4 0 S 3 S 4 S 2 S 5 Acyclic condensation S 1 DO I = 1, 10 S 3 B(I) = C(I-1)*X + Z S 4 C(I) = 1/B(I) S x S y DO I = 1, 10 S 2 A(I) = A(I) + B(I-1) S z S u DO I = 1, 10 S 5 D(I) = sqrt(c(i)) S v S 5 Dependence graph 2/20/09 HPC II Spring

62 Loop Fusion S 1 B(1) = T(1)*X(1) S 2 DO I = 2, N S 3 B(I) = T(I)*X(I) S 4 S 5 DO I = 2, N S 6 A(I) = B(I) - B(I-1) S 7 S 1 B(1) = T(1)*X(1) S x DO I = 2, N S 3 B(I) = T(I)*X(I) S 6 A(I) = B(I) - B(I-1) S y S 1 δ S 6 S 3 δ S 6 S 1 δ S 6 S 3 δ (=) S 6 S 3 δ (<) S 6 Original code has dependences S 1 δ S 6 and S 3 δ S 6 Fused loop has dependences S 1 δ S 6 and S 3 δ (=) S 6 and S 3 δ (<) S 6 Combine two consecutive loops with same IV and loop bounds into one Fused loop must preserve all dependence relations of the original loop Enables more effective scalar optimizations in fused loop But: may reduce temporal locality 62

63 Example S 1 DO I = 1, N S 2 A(I) = B(I) + 1 S 3 S 4 DO I = 1, N S 5 C(I) = A(I)/2 S 6 S 7 DO I = 1, N S 8 D(I) = 1/C(I+1) S 9 Which of the three fused loops is legal? a) b) c) S 1 DO I = 1, N S 2 A(I) = B(I) + 1 S 3 S x DO I = 1, N S 5 C(I) = A(I)/2 S 8 D(I) = 1/C(I+1) S y S x DO I = 1, N S 2 A(I) = B(I) + 1 S 5 C(I) = A(I)/2 S y S 7 DO I = 1, N S 8 D(I) = 1/C(I+1) S 9 S x DO I = 1, N S 2 A(I) = B(I) + 1 S 5 C(I) = A(I)/2 S 8 D(I) = 1/C(I+1) S y 63

64 Alignment for Fusion S 1 DO I = 1, N S 2 B(I) = T(I)/C S 3 S 4 DO I = 1, N S 5 A(I) = B(I+1) - B(I-1) S 6 S 1 DO I = 0, N-1 S 2 B(I+1) = T(I+1)/C S 3 S 4 DO I = 1, N S 5 A(I) = B(I+1) - B(I-1) S 6 S 2 δ S 5 S 2 δ S 5 Alignment for fusion changes iteration bounds of one loop to enable fusion when dependences would otherwise prevent fusion S x B(1) = T(1)/C S 1 DO I = 1, N-1 S 2 B(I+1) = T(I+1)/C S 5 A(I) = B(I+1) - B(I-1) S 6 S y A(N) = B(N+1) - B(N-1) Loop deps: S 2 δ (=) S 5 S 2 δ (<) S 5 64

65 Reversal for Fusion S 1 DO I = 1, N S 2 B(I) = T(I)*X(I) S 3 S 4 DO I = 1, N S 5 A(I) = B(I+1) S 6 S 1 DO I = N, 1, -1 S 2 B(I) = T(I)*X(I) S 3 S 4 DO I = N, 1, -1 S 5 A(I) = B(I+1) S 6 S 1 DO I = N, 1, -1 S 2 B(I) = T(I)*X(I) S 5 A(I) = B(I+1) S 6 S 2 δ S 5 S 2 δ S 5 S 2 δ (<) S 5 Reverse the direction of the iteration Only legal for loops that have no carried dependences Enables loop fusion by ensuring dependences are preserved between loop statements 65

66 Loop Interchange S 1 DO I = 1, N S 2 DO J = 1, M S 3 DO K = 1, L S 4 A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) S 5 S 6 S 7 S 4 δ (<,<,=) S 4 S 4 δ (<,=,>) S 4 Compute the direction matrix and find which columns can be permuted without violating dependence relations in original loop nest < < = < = > < < = < = > < = < = > < Invalid Direction matrix < < = < = > < < = = < > Valid 2/20/09 HPC II Spring

67 Loop Parallelization S 1 DO I = 1, 4 DO J = 1, 4 A(I,J+1) = A(I,J) It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence 4 J S 1 δ (=,<) S 1 parallelize I S 1 PARALLEL DO I = 1, 4 DO J = 1, 4 A(I,J+1) = A(I,J) 67

68 Vectorization (1) DO I = 1, 4 S 1 X(I) = X(I) + C vectorize A single-statement loop that carries no dependence can be vectorized S 1 X(1:4) = X(1:4) + C Fortran 90 array statement X(1) X(2) X(3) X(4) + C C C C X(1) X(2) X(3) X(4) Vector operation 68

69 Vectorization (2) S 1 δ (<) S 1 DO I = 1, N S 1 X(I+1) = X(I) + C This example has a loop-carried dependence, and is therefore invalid S 1 X(2:N+1) = X(1:N) + C Invalid Fortran 90 statement 69

70 Vectorization (3) DO I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E DO I = 1, N S 1 A(I+1) = B(I) + C DO I = 1, N S 2 D(I) = A(I) + E loop distribution vectorize S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E Because only single loop statements can be vectorized, loops with multiple statements must be transformed using the loop distribution transformation Loop has no loopcarried dependence or has forward flow dependences 70

71 Vectorization (4) DO I = 1, N S 1 D(I) = A(I) + E S 2 A(I+1) = B(I) + C DO I = 1, N S 2 A(I+1) = B(I) + C DO I = 1, N S 1 D(I) = A(I) + E loop distribution vectorize When a loop has backward flow dependences and no loop-independent dependences, interchange the statements to enable loop distribution S 2 A(2:N+1) = B(1:N) + C S 1 D(1:N) = A(1:N) + E 71

72 Preliminary Transformations to Support Dependence Testing To determine distance and direction vectors with dependence tests require array subscripts to be in a standard form Preliminary transformations put more subscripts in affine form, where subscripts are linear integer functions of loop induction variables Loop normalization Forward substitution Constant propagation Induction-variable substitution Scalar expansion 72

73 Forward Substitution and Constant Propagation N = 2 DO I = 1, 100 K = I + N S 1 A(K) = A(K) + 5 DO I = 1, 100 S 1 A(I+2) = A(I+2) + 5 Forward substitution and constant propagation help modify array subscripts for dependence analysis 73

74 Induction Variable Substitution Example loop with IVs I = 0 J = 1 while (I<N) I = I+1 = A[J] J = J+2 K = 2*I A[K] = endwhile IVS After IV substitution (IVS) (note the affine indexes) for i=0 to N-1 S 1 : = A[2*i+1] S 2 : A[2*i+2] = endfor Dep test After parallelization forall (i=0,n-1) = A[2*i+1] A[2*i+2] = endforall A[] W R W R W R A[2*i+1] A[2*i+2] GCD test to solve dependence equation 2i d - 2i u = -1 Since 2 does not divide 1 there is no data dependence. 74

75 Example (1) INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 KI = KI + INC U(KI) = U(KI) + W(J) S(I) = U(KI) Before and after induction-variable substitution of KI in the inner loop INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+J*INC) = U(KI+J*INC) + W(J) KI = KI * INC S(I) = U(KI) 75

76 Example (2) INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+J*INC) = U(KI+J*INC) + W(J) KI = KI + 100*INC S(I) = U(KI) Before and after induction-variable substitution of KI in the outer loop INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+(I-1)*100*INC+J*INC) = U(KI+(I-1)*100*INC+J*INC) + W(J) S(I) = U(KI+I*100*INC) KI = KI + 100*100*INC 76

77 Example (3) INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+(I-1)*100*INC+J*INC) = U(KI+(I-1)*100*INC+J*INC) + W(J) S(I) = U(KI+I*100*INC) KI = KI + 100*100*INC Affine subscripts Before and after constant propagation of KI and INC DO I = 1, 100 DO J = 1, 100 U(I*200+J*2-200) = U(I*200+J*2-200) + W(J) S(I) = U(I*200) KI =

78 Scalar Expansion S 1 DO I = 1, N S 2 T = A(I) + B(I) S 3 C(I) = T + 1/T S 4 S x IF N > 0 THEN S y ALLOC Tx(1:N) S 1 DO I = 1, N S 2 Tx(I) = A(I) + B(I) S x C(I) = Tx(I) + 1/Tx(I) S 4 S z T = Tx(N) S u ENDIF S 2 δ (=) S 3 S 2 δ -1 (<) S 3 S 2 δ (=) S 3 Breaks antidependence relations by expanding or promoting a scalar into an array Scalar anti-dependence relations prevent certain loop transformations such as loop fission and loop interchange 78

79 Example S 1 DO I = 1, 10 S 2 T = A(I,1) S 3 DO J = 2, 10 S 4 T = T + A(I,J) S 5 S 6 B(I) = T S 7 S 2 δ (=) S 4 S 4 δ (=,<) S 4 S 4 δ (=) S 6 S 2 δ -1 (<) S 6 S 1 DO I = 1, 10 S 2 Tx(I) = A(I,1) S 3 DO J = 2, 10 S 4 Tx(I) = Tx(I)+A(I,J) S 5 S 6 B(I) = Tx(I) S 7 S 2 δ (=) S 4 S 4 δ (=,<) S 4 S 4 δ (=) S 6 S 1 DO I = 1, 10 S 2 Tx(I) = A(I,1) S x S 1 DO I = 1, 10 S 3 DO J = 2, 10 S 4 Tx(I) = Tx(I) + A(I,J) S 5 S y S z DO I = 1, 10 S 6 B(I) = Tx(I) S 7 S 2 δ S 4 S 4 δ (=,<) S 4 S 4 δ S 6 S 2 Tx(1:10) = A(1:10,1) S 3 DO J = 2, 10 S 4 Tx(1:10) = Tx(1:10)+A(1:10,J) S 5 S 6 B(1:10) = Tx(1:10) S 2 δ S 4 S 4 δ (<,=) S 4 S 4 δ S 6 79

80 Unimodular Loop Transformations A unimodular matrix U is a matrix with integer entries and determinant ±1 Such a matrix maps an object onto another object with exactly the same number of integer points in it The set of all unimodular transformations forms a group, called the modular group The identity matrix is the identity element The inverse U - ¹ of U exist and is also unimodular Unimodular transformations are closed under multiplication: a composition of unimodular transformations is unimodular Intel 9/12/05 Intel 9/12/05 80

81 Unimodular Loop Transformations (cont d) Many loop transformations can be represented as matrix operations using unimodular matrices, including loop interchange, loop reversal, loop interchange, DO K = 1, 100 DO L = 1, 50 A(K,L) = U interchange = 2/20/ DO L = 1, 50 DO K = 1, 100 A(L,K) = 1 0 U reversal = U skewing = 0 1 DO K = -100, -1 DO L = 1, 50 A(-K,L) = 1 0 p 1 DO K = 1, 100 DO L = 1+p*K, 50+p*K A(K,L-p*K) = HPC II Spring

82 Unimodular Loop Transformations (cont d) The unimodular transformation is also applied to the dependence vectors: if the result is lexicographically positive, the transformation is legal DO K = 2, N DO L = 1, N-1 A(L,K) = A(L+1,K-1) Loop has distance vector d(i,j) = (1,-1) for the writes at iterations i=(i 1,i 2 ) and reads at iterations j=(j 1,j 2 ) U interchange d = d new = 1 1 2/20/09 HPC II Spring

83 2/20/09 Unimodular Loop Transformations (cont d) Transformed loop nest is given by A U - ¹ I b The new loop body has indices I = U -1 I Iteration space IS = {I A I b } after transformation IS = {I A U -1 I b } Transformed loop nest needs to be normalized by means of Fourier- Motzkin elimination to ensure that loop bounds are affine expressions in more outer loop indices M N DO M = 1, 3 DO N = M+1, 4 A(M,N) = M = 0 1 N 1 0 K L L K DO K = 2, 4 DO L = 1, MIN(3,K-1) A(L,K) = FMVE L 0 1 K L 3 and L K-1 2 K 4 HPC II Spring

84 [Wolfe] Chapters 5, 7-9 Further Reading 2/20/09 HPC II Spring

Saturday, April 23, Dependence Analysis

Saturday, April 23, Dependence Analysis Dependence Analysis Motivating question Can the loops on the right be run in parallel? i.e., can different processors run different iterations in parallel? What needs to be true for a loop to be parallelizable?