Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Size: px

Start display at page:

Download "Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1"

Gillian Watts
5 years ago
Views:

1 Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

2 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: \course\cpeg421-08s\topic-7.ppt 2

3 ABET Outcome Ability to apply knowledge of basic code generation techniques, e.g. Loop scheduling e.g. software pipelining techniques to solve code generation problems. An ability to identify, formulate and solve loops scheduling problems using software pipelining techniques Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness. Ability to use a modern compiler development platform and tools for the practice of above. A Knowledge on contemporary issues on this topic \course\cpeg421-08s\topic-7.ppt 3

4 Outline Brief overview Problem formulation of the modulo scheduling problem Solution methods Summary \course\cpeg421-08s\topic-7.ppt 4

5 General Compiler Framework Source ME Inter-Procedural Optimization (IPA) Loop Nest Optimization (LNO) Global Optimization (OPT) Innermost Loop scheduling BE/CG Global inst scheduling Reg alloc Local inst scheduling Arch Models Good IPO Good LNO Good global optimization Good integration of IPO/LNO/OPT Smooth information passing between FE and CG Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation Executable \course\cpeg421-08s\topic-7.ppt 5

6 Questions? How to formulate the loop scheduling problem? How to model it? How to solve it? \course\cpeg421-08s\topic-7.ppt 6

7 Questions (cont d) Instruction scheduling for code without loops a review Dependence graphs may become cyclic! So, critical path length is less obvious! Is it becoming harder!(?) What new insights are required to formulate and solve it? \course\cpeg421-08s\topic-7.ppt 7

8 Challenges of Loop Scheduling A DDG With Cycles Right strategy? \course\cpeg421-08s\topic-7.ppt 8

9 Observations Execution of good loops tend to be regular and repetitive - a pattern may appear This gives cyclic scheduling problem a new twist! How to efficiently derive a pattern? \course\cpeg421-08s\topic-7.ppt 9

10 Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is time-optimal under a machine model M. Def: A schedule S of a loop L is time-optimal if among all legal schedules of L, no other schedule that is faster than S. Note: There may be more than one time-optimal schedule \course\cpeg421-08s\topic-7.ppt 10

11 A Short Tour on Data Dependence Graphs for Loops \course\cpeg421-08s\topic-7.ppt 11

12 Basic Concept and Motivation Data dependence between 2 accesses The same memory location Exist an execution path between them One of them is a write Three types of data dependence Dependence graphs Things are not simple when dealing with loops \course\cpeg421-08s\topic-7.ppt 12

13 Types of Data Dependence Flow dependence X = = X... δ Anti-dependence X = = X δ -1 Output dependence X = X =... 0 δ \course\cpeg421-08s\topic-7.ppt 13

14 Data Dependence Example 1: S1: A = 0 S2: B = A S3: C = A + D S4: D = 2 S1 S2 S3 S4 S x δ S y S y depends on S x \course\cpeg421-08s\topic-7.ppt 14

15 Data Dependence Con d Example 2: S1: A = 0 S2: B = A S3: A = B + 1 S4: C = A S1 S2 S3 S4 S 1 S 2 δ δ 0 S 3 : Output-dep -1 S 3 : anti-dep \course\cpeg421-08s\topic-7.ppt 15

16 Should we consider input dependence? = X = X Is the reading of the same X important? Well, it may be! (if we intend to group the 2 reads together for cache optimization!) \course\cpeg421-08s\topic-7.ppt 16

17 Subscript Variables Extension of def-use chains to employ a more precise treatment of arrays, especially in iterative loops. DO I = 1, N A(I + 1) = X(I + 1) + B(I) X(I) = A(I) * 5 ENDDO \course\cpeg421-08s\topic-7.ppt 17

18 Dependence Graph Con d Applications - register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization \course\cpeg421-08s\topic-7.ppt 18

19 Data Dependence in Loops An Example Find the dependence relations due to the array X in the program below: (1) for I = 2 to 9 do (2) X[I] = Y[I] + Z[I] (3) A[I] = X[I-1] + 1 (4) end for Solution To find the data dependence relations in a simple loop, we can unroll the loop and see which statement instances depend on which others: I = 2 I = 3 I = 4 (2) X[2]=Y[2]+Z[2] (3) A[2]=X[1]+1 X[3] =Y[3]+Z[3] A[3] =X[2]+1 X[4]=Y[4]+Z[4] A[4]=X[3] \course\cpeg421-08s\topic-7.ppt 19

20 Data Dependence in Loops Con d In our example, there is a loop-carried, lexically forward flow dependence relation. S 2 S 3 (1) Dependence distance = 1 Data dependence graph for statements in a loop. - Loop-carried vs loop-independent - Lexical-forward vs lexical backward \course\cpeg421-08s\topic-7.ppt 20

21 An Example for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; a b time i = 0 i = 1 i = 2 0 a end; 1 b Note: We use a token here to represent a flow dependence of distance = 1 II time 2 c a 3 b 4 c a So, iteration interval II = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! \course\cpeg421-08s\topic-7.ppt 21

22 Software Pipeline: Concept Software Pipeline is a technique that reduces the execution time of important loops by interweaving operations from many iterations to optimize the use of resources. stf fadds ldf cmp bg sub time \course\cpeg421-08s\topic-7.ppt 22

23 The Structrure of the SWP code % prologue a[0] = a[-1] + R[0]; % pattern for i = 0 to N-2 do b[i] = a[i] + c[i -1]; a[i + 1] = a[i] + R[i + 1]; c[i] = b[i] + 1; end; % epilogue b[n - 1] = a[n - 1] + c[n - 2]; c[n - 1] = b[n - 1] + 1 prologue b i, c i, a i+1 epilogue \course\cpeg421-08s\topic-7.ppt 23

24 Software Pipeline (Cont d) What limits the speed of a loop? Data dependencies: recurrence initiation interval (rec_mii) Processor resources: resource initiation interval (res_mii) Initiation interval stf fadds ldf cmp bg sub time \course\cpeg421-08s\topic-7.ppt 24

25 Previous Approaches Approach I (Operational): Emulate the loop execution under the machine model and a pattern will eventually occur [AikenNic88, EbciogluNic89, GaoEtAl91] Approach II (Periodic scheduling): Specify the scheduling problem into a periodical scheduling problem and find optimal solution [Lam88, RauEtAl81,GovindAltmanGao94] \course\cpeg421-08s\topic-7.ppt 25

26 Periodic Schedule (Modulo Scheduling) The time (cycle) when the I-th instance of the operation v is scheduled : t(i, v) = T * i+ A[v] where T = II so t(i + 1, v) -t(i, v) = T*(i + 1) T*(i) = T For our example: t(i, v) = 2i + A[v] where A(a) = 1 A(b) = 0 A(c) = 1 Question: Is this an optimal schedule? \course\cpeg421-08s\topic-7.ppt 26

27 Periodic Schedule (cont d) Yes, the schedule t(i, v) = 2i + A[v] is time-optimal! With II = \course\cpeg421-08s\topic-7.ppt 27

28 Restate the problem Given a DDG of a loop L, how to determine the fastest computation rate of L -- also called minimum initiation interval (MII)? Hint: Consider the Critical Cycles as well as critical resource usage in L \course\cpeg421-08s\topic-7.ppt 28

29 Recurrence MII -- RecMII An Example \course\cpeg421-08s\topic-7.ppt 29

30 An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; a b II time time i = 0 i = 1 i = 2 0 a 1 b 2 c a 3 b 4 c a So, RecMII = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! \course\cpeg421-08s\topic-7.ppt 30

31 Hint: now one must think about deadlines. Maximum Computation Rate Theorem: The maximum computation rate of a loop is bounded by the following ratio r opt (C) = min{ Dc/Wc} where C is a dependence cycle, Dc is the total dependence distance along C and Wc is total execution time of C. i.e. Dc Wc = c = c di wi (d i is the dependence distance along the edge i in C) (w i is the edge weight along the edge i in C) \course\cpeg421-08s\topic-7.ppt 31

32 RecMII (Contn d) And the optimal period MII = T opt = 1/r opt Def: A cycle is critical if the period of the cycle equal to MII (We should write it as RecMII) Note: a loop may have multiple critical cycles! \course\cpeg421-08s\topic-7.ppt 32

33 An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; a b II time time i = 0 i = 1 i = 2 0 a 1 b 2 c a 3 b 4 c a So, RecMII = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! \course\cpeg421-08s\topic-7.ppt 33

34 How About Machine Resource and Register Constraints? Numbers and types of FUs, etc, (ResMII) Number of registers \course\cpeg421-08s\topic-7.ppt 34

35 Software Pipelining Review of previous work [RauFisher93] [Rau94] Minimum initiation interval [MII] MII = max {RecMII, ResMII} where RecMII: determined by critical recurrence cycles in DDG ResMII: determined by resource constraints (#s and types of FUs) \course\cpeg421-08s\topic-7.ppt 35

36 Consider a simple example of n nodes and 1 FU -- what is ResMII? The Resource Constraint MII (ResMII) Can be calculated by totaling, for each resource, the usage requirement imposed by one iteration of the loop can be done by bin-packing a resource reservation table (expensive) usually derive a lower-bound is enough as a starting point to begin the search process More price with complex hardware pipelines? \course\cpeg421-08s\topic-7.ppt 36

37 Example 1: Reservation Table - An Example 1 Stage X X X X X 2 3 X (a) a Pipeline M (b) The Reservation Table of M \course\cpeg421-08s\topic-7.ppt 37

38 A Quiz? What is the RT of a fully pipelined adder with 3 stages? How about an unpipelined adder? How to computer ResMII for each case? \course\cpeg421-08s\topic-7.ppt 38

39 Modulo Reservation Table The modulo reservation table only has length = II Each entry in the table records a sequence of reservations corresponds a sequence of slots every II cycles Hints: think about your weekly calendar \course\cpeg421-08s\topic-7.ppt 39

40 Modulo Scheduling MII II = II + 1 Choose Next Operation X failed try to place x in a time slot i in II A legal schedule is found succeed Remove from L is L =<>? No \course\cpeg421-08s\topic-7.ppt 40

41 Heuristic Method for Modulo Scheduling Why a simple variant list scheduling may not work? Hint: consider the deadline constraints of operations in a cycle \course\cpeg421-08s\topic-7.ppt 41

42 Counter Example I: List Scheduling May Fail! (a) [1] A 4 C 2 B 2 D 4 (b) MII = RecMII = 4 Note: if simple list scheduling is used B cannot be scheduled due to the deadline set by scheduling C: deadlock! A B C D (c) MII = RecMII = 4 Note: we cannot fire C as early as possible! \course\cpeg421-08s\topic-7.ppt A B C D

43 Example I (Cont d) In previous figure, We show an example demonstrating the problem with greedy scheduling in the presence of recurrences. (a) The data dependence graph with a cycle. (b) The resulting partial schedule when C is scheduled greedily. B cannot be scheduled. (c) The resulting valid schedule when C is not scheduled greedyly delayed to two cycles later \course\cpeg421-08s\topic-7.ppt 43

44 Example 2: Problems with Greedy schedule due to ResMII A1 C2 A3 A4 M M Adder Mult Bus A1 M6 0 1 A4 C2 2 A Adder Mult Bus A1 M6 C2 A4 A3 M5 ResMII = 6 ResMII = 6 (a) DDG A: non-pipelined adders M: non-pipelined multipliers (b) A greedy schedule which (c) a non-greedy schedule cannot schedule A4 with which achieves ResMII ResMII \course\cpeg421-08s\topic-7.ppt 44

45 Example 2 (Cont d) In previous figure, We show an example demonstrating the problem with greedy scheduling in the presence of complex reservation tables. (a) The data dependence graph without cycle. (b) The resulting partial schedule when A1, M6, C2, and A3 are scheduled greedily. A4 cannot be scheduled. (c) The resulting valid schedule when A3 is scheduled one cycle later \course\cpeg421-08s\topic-7.ppt 45

46 Example 3: infeasibility of MII Con d A1 2 M2 3 MA3 5 Adder Mult Bus Adder Mult Bus Adder Mult Bus Note: ResMII = 2. But is there a legal schedule under II = 2? Note: The presence of complex reservation tables \course\cpeg421-08s\topic-7.ppt 46 (a)

47 Example 3 (cont d) Adder Mult Bus A1 M2 M2 Adder Mult Bus A1 M2 A1 MA3 A1 MA3 M2 MII = 2 II = 2 (b) cannot fit MA3 under II=2 MII = 2 II = 3 (c) A feasible schedule for II=3 Note: It is possible that there is no valid schedule at MII! \course\cpeg421-08s\topic-7.ppt 47

48 Infeasibility of MII Con d The previous slide shows an example demonstrating the infeasibility of the MII in the presence of complex reservation tables. (a) The three operations and their reservation tables. (b) The MRT corresponding to the dead-end partial schedule for an II of 2 after A1 and M2 have been scheduled. (c) The MRT corresponding to a valid schedule for an II of \course\cpeg421-08s\topic-7.ppt 48

49 Example 4: Infeasibility of MII Assume: fully pipelined adders A1 A2 2 2 A3 2 Adder Mult Bus 2 A3 2 [2] A4 2 2 A4 2 ResMII = 4 RecMII = 4 (a) Note: It is possible that there is no valid schedule at MII! And, the reservation table here is simple! \course\cpeg421-08s\topic-7.ppt 49 (b)

50 Example 4 (cont d) The previous slides shows an example demonstrating the infeasibility of the MII due to the interaction between the recurrence constraints and the resource usage constraints. (a) The data dependence graph. (b) The MRT corresponding to the dead-end partial schedule for an II of 4 after A1 and A2 have been scheduled \course\cpeg421-08s\topic-7.ppt 50

51 How to derive a best feasible schedule? It is possible do so via exhaustive search. But, it is expensive! \course\cpeg421-08s\topic-7.ppt 51

52 Software Pipelining A Taxonomy of Software Pipelining Operational Approach Periodic Scheduling (Modulo Scheduling) FSA Based Method (Model hardware pipeline with sharing and hazards) Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc.) Formal Model (GaoWonNin91) PLDI'91 In-Exact Method (Heuristic) (RauGla81, Lam88, RauEtAI92, Huff93, DehnertTow93, Rau94, WanEis93, StouchininEtAI00) Basic Formulation (DongenGao92) CONPAR'92 Register Optimal (Ninggao91, NingGao93, Ning93) POPL'93 Exact Method ILP based Resource Constrained (GovindAITGao94) Micro27 "Showdown" (RuttenbergGao StouchininWoody96) PLDI'96 Exhaustive Search EuroPar'96 FSA Co-Scheduling formulation (GovindAltmanGao'96) FSA Construction Method (GovindAltmanGao98) FSA Heuristic/optimization (ZhangGovindRyanGao99) Theory of Co-Scheduling (GovindAltmanGao00) Resource & Register (GovindAITGao95, Altman95, PLDI'95 EichenbergerDav95) (Altman95) \course\cpeg421-08s\topic-7.ppt 52

53 Advanced Topics Consider register constraints Realistic pipeline architecture constraints (with structural hazards) Loop body with conditionals Multi-Dimensional Loops \course\cpeg421-08s\topic-7.ppt 53

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between