CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code

Size: px

Start display at page:

Download "CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code"

Whitney Hutchinson
5 years ago
Views:

1 CODE GENERATION Goal Translate intermediate code into target code Interplay between Register Allocation Instruction Selection Instruction Scheduling 1 REGISTER ALLOCATION 1

2 REGISTER ALLOCATION Motivation Instructions involving register operands are usually more efficient than those involving operands in memory The number of physical registers is limited Moving data between register and memory/cache is expensive The problem of deciding where to store values at each point in the code Register allocation: which values reside in registers Register assignment: which register to use for a particular value This distinction is often lost in the literature 3 REGISTER ALLOCATION GOAL Effectively use the limited number of registers Still need to produce correct code Want to generate efficient code Registers are faster than memory Minimize loads/stores from/to memory Minimize space used to hold spilled values RISC vs. CISC Complications Register classes Special-purpose registers Operators have additional constraints on registers A certain number of registers need to be reserved 4 2

3 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b Register-register addressing mode Cost assumption: load/store = 2 and add/mult = 1 LOAD R1,b LOAD R2,c ADD R1,R1,R2 STORE R1,a LOAD R1,a MUL R1,R1,R1 STORE R1,t1 LOAD R1,t1 LOAD R2,a ADD R1,R1,R2 STORE R1,b LOAD R1,t1 LOAD R2,b MUL R1,R1,R2 STORE R1,c cost = 26 LOAD R1,b LOAD R2,c ADD R1,R1,R2 STORE R1, a MUL R2,R1,R1 ADD R1,R1,R2 MUL R2,R1,R2 STORE R1,b STORE R2,c cost = 14 5 HIGH-LEVEL PROCESS Problem At each instruction, decide which values to keep in registers Simple if values registers Harder if values > registers If there are not enough registers, decide which registers to spill to memory Insert code to move values between registers and memory The compiler must automate this process 6 3

4 COMPLEXITY Can we do this optimally? (on real code?) Local Allocation Local Assignment Simplified cases O(n) Single size, no spilling O(n) Real cases NP-Complete Two sizes NP-Complete Global Allocation NP-Complete for 1 register NP-Complete for k registers Global Assignment NP-Complete Local register allocation operates on basic blocks Real compilers face real problems 7 LOCAL REGISTER ALLOCATION Register allocation within a single basic block Assumptions that simplify the discussion No registers inherited from predecessors No values left in registers at the end of the basic block Only a single register class: general-purpose registers Two approaches Top-down allocator Work from external derived information of what is important Bottom-up allocator Work from synthesized knowledge about problem instances 8 4

5 TOP-DOWN ALLOCATOR General idea Keep heavily used values in registers Algorithm Rank values by number of occurrences First pass: tally the use counts for all virtual registers ex: S blocks B use(x,b) + 2*live(x,B) where use(x,b) counts the number of times x is used in B live(x,b) = 1 only if x is live on exit from B and given a value in B Allocate first k values to registers Assuming k is the number of available registers Rewrite code to reflect these choices Second pass: store/load inserted Common technique of 60 s and 70 s An allocated register is dedicated to a value for the entire basic block 9 BOTTOM-UP ALLOCATOR General idea: Keep values used soon in registers Algorithm Start with empty register set Load on demand When no register is available, free one Focus on replacement rather than allocation Spill the value whose next use is farthest in the future Sound familiar? Think page replacement

6 BASIC BOTTOM-UP ALGORITHM Local liveness information A variable is live if it has a future use A non-live value in a register can be discarded, freeing the register it currently occupies All variables stored in memory at the end of the block Data structures: Register descriptor - register status (empty, full) and contents (one or more "values") Address descriptor - the location (or locations) where the current value for a variable can be found (register, memory) 11 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b t2 = c + b a = t2 + t2 live = {a,b,c} All non-temporary variables stored back in memory unless we can prove they are not live after the block 12 6

7 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS (a) from the set since this is a definition. Add RHS variables to the set since this is a use. 13 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b live = {b,c} t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS (t2) from the set since this is a definition. Add RHS variables to the set since this is a use. 14 7

8 EXAMPLE live = {b,c} a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = { b,t1} c = t1 * b live = {b,c} t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS from the set since this is a definition. Add RHS variables to the set since this is a use. 15 OPERATION X = Y OP Z Assume a register-only addressing mode (i.e. RISC) Each operation needs all its operands in register before execution A register must be assigned for target (result) Iterate over operations in the block If y or z already in a register, re-use it If either y or z is not in a register, load it into a free register All full select the one used furthest in the future to spill If y or z not live after this instruction, free its register Allocate a register for x All full select the one used furthest in the future to spill 16 8

9 EXAMPLE live = {b,c} 1. a = b + c live = {a} 2. t1 = a * a live = {a,t1} 3. b = t1 + a live = { b,t1} 4. c = t1 * b live = {b,c} 5. t2 = c + b live = {t2,b,c} 6. a = t2 + t2 live = {a,b,c} 17 EXAMPLE Initially Three registers: (R1, R2, R3) = ( -, -, -) all empty Current values: (a,b,c,t1,t2) = (m,m,m, -, -) Instruction 1: a = b + c Live = {a}: no need to keep b or c in register LOAD R1,b LOAD R2,c ADD R1,R1,R2 //R1 = R1 + R2 Registers: (R1, R2, R3) = (a, -, -) Current values: (a, b, c, t1, t2) = (R1,m,m, -, -) 18 9

10 Example Instruction 2: t1 = a * a After instruction 1: Registers: (R1, R2, R3) = (a, -, -) After instruction 2: Live = {a,t1} MUL R2,R1,R1 //R2 = R1* R1 Registers: (R1,R2,R3) = (a,t1, -) Current values: (a,b,c,t1,t2) = (R1,m,m,R2, -) Instruction 3: b = t1 + a Live = {b,t1} Register allocated to b is R1 (since a is not live after this inst) ADD R1, R1, R2 //R1 = R1 + R2 Registers: (R1,R2,R3) = (b,t1, -) Current values: (a,b,c,t1,t2) = (-,R1,m,R2, -) 19 Example Instruction 4: c = t1 * b Live = {b,c} [Registers: (R1,R2,R3) = (b,t1, -)] Register allocated to c is R2 (since t1 is not live after this inst) MUL R2, R1, R2 //R2 = R1 * R2 Registers: (R1,R2,R3) = (b,c, -) Current values: (a,b,c,t1,t2) = (-,R1,R2, -, -) Instruction 5: t2 = c + b Live = {b,c,t2} Register allocated to t2 is R3 ADD R3,R1,R2 //R3 = R1 + R2 Registers: (R1,R2,R3) = (b,c,t2) Current values: (a,b,c,t1,t2) = (-,R1,R2, -,R3) 20 10

11 Example Instruction 6: a = t2 + t2 Live = {a,b,c} [ Registers: (R1,R2,R3) = (b,c,t2) ] ADD R3,R3,R3 Registers: (R1,R2,R3) = (b,c,a) Current values: (a,b,c,t1,t2) = (R3,R1,R2, -,-) Since end of block, move variables to memory MOV R3,a MOV R1,b MOV R2,c Registers: (R1,R2,R3) = (-,-,-) Current values: (a,b,c,t1,t2) = (m,m,m,-,-) 21 EXAMPLE SOURCE a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = {b,t1} c = t1 * b live = {b,c} t2 = c + b live = {b,c,t2} a = t2 + t2 live = {a,b,c} TARGET LOAD R1,b LOAD R2,c ADD R1,R1,R2 MUL R2, R1,R1 ADD R1,R1,R2 MUL R2,R1, R2 ADD R3,R1,R2 ADD R3,R3,R3 MOV R3,a MOV R1,b MOV R2,c REGISTER (a, -, -) (a,t1, -) (b,t1, -) (b,c, -) (b,c,t2) (b,c,a) (-,-,-) 22 11

12 Example with Spilling For the same sequence, if we only have 2 physical registers After instruction 4: (Slide 16) Registers: (R1,R2) = (b,c) Current values: (a,b,c,t1,t2) = (-,R1,R2, -, -) Instruction 5: t2 = c + b Live = {b,c,t2} Distance to the next use: dist(b)=dist(c) = (no use in this block), pick either to spill MOV b,r1 SPILL! ADD R1,R1,R2 Registers: (R1,R2) = (t2,c) Current values: (a,b,c,t1,t2) = (-,m,r2, -,R1) 23 LIVE EXAMPLE SOURCE a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = {b,t1} c = t1 * b live = {b,c} t2 = c + b live = {b,c,t2} a = t2 + t2 live = {a,b,c} TARGET LOAD R1,b LOAD R2, ADD R1,R1,R2 MUL R2, R1,R1 ADD R1,R2,R1 MUL R2,R1, R2 MOV R1,b ADD R1,R1,R2 ADD R1,R1,R1 MOV R1,a MOV R2,c REGISTER (a, -,) (a,t1) (b,t1) (b,c) (t2,c) (a,c) (-,-) 24 12

13 SPILLING CONSIDERATIONS Goal: minimize spilling cost Heuristics 1: spill values that is used farthest in the future Heuristics 2: spill clean values instead of dirty values A value that does not change and hence need not be stored on a spill is called clean; otherwise dirty Taking into account the distinction between clean and dirty values makes local allocation NP-hard No guarantee which heuristic will produce a better allocation for all cases 25 EXAMPLE Initial: 2 registers (x1,x2), x1 clean, x2 dirty Reference string 1: x3 x1 x2 (x3 clean) Spill furthest (x2): Store x2 Load x3 Load x2 Spill clean (x1): Load x3 Load x1 (x1,x3) (x3,x2) (x1,x2) 26 13

14 EXAMPLE Initial: 2 registers (x1,x2), x1 clean, x2 dirty Reference String 2: x3 x1 x3 x1 x2 (x3 clean) Spill furthest: Store x2 Load x3 Load x2 (x1,x3) Spill clean: Load x3 (x3,x2) Load x1 (x1,x2) Load x3 (x3,x2) Load x1 (x1,x2) 27 REGISTER ALLOCATION... store r4 x load x r1... Given global liveness, this is an assignment problem, not an allocation problem! Local register allocations assume that variables are stored/loaded to/from memory at block boundaries Could replace a load with a move Good assignment would obviate the move What s harder across multiple blocks? Must build a control-flow graph to understand interblock flow and assign registers in a consistent 28 way 14

15 REGISTER ALLOCATION... store r4 x... store r3 x load x r1... What if one predecessor has x in a register, but the other does not? A more complex scenario Block with multiple predecessors in the control-flow graph Must get the right values in the right registers from either predecessor Other issues How to determine the execution frequency of an inst? How to define furthest next references? 29 REGISTER ALLOCATION Graph coloring paradigm Local, Superlocal (EBB), Global Build an interference graph Computing global LIVE information Construct a K-coloring K is the number of available physical registers Spill some variables if k-coloring cannot be constructed Minimal coloring is NP-Complete Map colors onto physical registers 30 15

16 LIVE RANGES A single live range contains a set of definitions and uses Concept relies on the notion of liveness: variable x is live at point p if it has been defined and there is path leading from p to a use of x Live range should start with a definition and end with the last use of that definition A specific variable may have many distinct live ranges Register allocation based on live ranges can place distinct live ranges in different registers A variable can be stored in different registers at distinct points in the program execution 31 LIVE RANGES EXAMPLE live = {b,c} 1:a = b + c live = {a} 2:t1 = a * a live = {a,t1} 3:b = t1 + a live = { b,t1} 4:c = t1 * b live = {b,c} 5:t2 = c + b live = {t2,b,c} 6:a = t2 + t2 live = {a,b,c} Live ranges: [1,3], [6, exit] -- a [entry,1],[3,exit] -- b [entry,1],[4,exit] -- c [2,4] -- t1 [5,6] -- t2 In this sequence, each definition introduces a new live range: sounds familiar? Live variables that need to be propagated to next basic block (LIVEOUT) 32 16

17 DISCOVERING GLOBAL LIVE RANGES SSA provides a natural starting point Each definition gets assigned a new live range Variables live across multiple blocks: a single live range for definition and uses A use that might be reached by multiple definitions: -functions A single live range is assigned to both parameters and target of a -function... X0 =... X1 =... Rx =... Rx = X2 = (X0, X1)... = X2... = Rx 33 INTERFERENCE Definition: Two values interfere if at some point in the program both are simultaneously live Overlapped live ranges If x and y interfere, they cannot occupy the same register Construct interference graph GI = <N,E> Nodes N in GI represent live ranges Edges E in GI represent interferences between live ranges For x, y N, <x,y> E iff x and y interfere A k-coloring of GI can be mapped into an allocation to k registers 34 17

18 BUILDING INTERFERENCE GRAPHS Algorithm to construct G I = <N,E> For each live range LR i, create a node N i N For each basic block b LIVENOW = LIVEOUT(b) for each operation o i in b in reverse order, with form o i : op i LR a, LR b LR c foreach LR j in LIVENOW add(lr c, LR j ) to E //interference remove LR c from LIVENOW add LR a and LR b to LIVENOW 35 INTERFERENCE GRAPH EXAMPLE live = {b,c} a 1 = b 1 + c 1 live = {a} t1 = a 1 * a 1 live = {a,t1} b 2 = t1 + a 1 live = { b,t1} c 2 = t1 * b 2 live = {b,c} t2 = c 2 + b 2 live = {b,c,t2} a 2 = t2 + t2 live = {a,b,c} a 1 t1 a 2 b 2 b 1 c 1 c 2 t

19 GRAPH COLORING PROBLEM Problem: A graph G is said to be k-colorable iff the nodes can be labeled with integers 1 k so that no edge in G connects two nodes with the same label K = 3 K = 4 No color for this node A 4-colorable example 37 Determining if a graph is k-colorable is NP-complete for k>2 CHAITIN S ALGORITHM Bottom-up coloring in EAC for k registers 1. Compute liveness information 2. Create interference graph G 3. Simplify Pick any node n with < k neighbors, remove it along with all edges incident to it from the graph and push it onto a stack This will lower the degree of n s neighbors If (G - n) can be colored with k colors, so can G If we reduce the entire graph, goto step 5; otherwise repeat step 3 38 Chaitin, Register Allocation and Spilling via Graph Coloring, SIGPLAN CC, June

20 Chaitin s Algorithm (cont d) 4. Spill After step 3, if the graph is not empty: we get to the point where we are left with only nodes with a degree >= k Mark some node for potential spilling, remove, back to step 3 5. Assign colors Starting with empty graph, rebuild graph by popping elements off the stack and assigning a color different from neighbors Potential spill nodes may or may not be colorable 39 EXAMPLE Interference graph from previous example a 1 Assume k = 3 Pick a 1 t1 a 2 c 2 b 2 t

21 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 t2 a 1 Remove a 1 Next: pick t1 41 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 t2 t1 a 1 Remove t1 Next: pick a

22 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 a 2 t1 a 1 Remove a 2 Next: pick b 2 b 2 t2 43 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 a 2 t1 a 1 Remove b 2 Next: pick c 2 b 2 t

23 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 c 2 b 2 a 2 t1 a 1 Remove c 2 b 2 t2 45 EXAMPLE Assume k = 3 a 1 t2 c 2 b 2 a 2 t1 a 1 t1 a 2 b 2 Remove t2: empty graph! c 2 t

24 EXAMPLE Assume k = 3 1: black 2: red 3: blu t2 c 2 b 2 a 2 t1 a 1 Pop and rebuild 47 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 b 2 a 2 t1 a 1 Pop and rebuild t

25 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 b 2 a 2 t1 a 1 Pop and rebuild t2 49 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 a 2 t1 a 1 Pop and rebuild b 2 t

26 EXAMPLE Assume k = 3 1: black 2: red 3: blu a 2 c 2 b 2 t2 t1 a 1 Pop and rebuild 51 EXAMPLE Assume k = 3 1: black 2: red 3: blu t1 a 2 c 2 b 2 t2 a 1 Pop and rebuild 52 26

27 EXAMPLE Assume k = 3 1: black 2: red 3: blu a 1 t1 a 2 b 1 c 1 c 2 b 2 t2 a 1 = b 1 + c 1 t1 = a 1 * a 1 b 2 = t1 + a 1 c 2 = t1 * b 2 t2 = c 2 + b 2 a 2 = t2 + t2 R 1 = R 1 + R 2 R 2 = R 1 * R 1 R 3 = R 2 + R 1 R 2 = R 2 * R 3 R 1 = R 2 + R 3 R 1 = R 1 + R 1 53 EXAMPLE 2 a b d k=4: only four physical registers available t f e Can t simplify no vertex has fewer than 4 neighbors Choose t as potential spill : highest degree 54 27

28 EXAMPLE 2 a b d k=4: only four physical registers available t f e After removing t, f and b have a degree <4 55 EXAMPLE 2 a b d k=4: only four physical registers available t f e Stack: f Now all nodes with a degree <

29 EXAMPLE 2 a b d k=4: only four physical registers available t f e Stack: f e d b a 57 EXAMPLE 2 t a f b d e k=4: only four physical registers available Stack: f e d b a Pop up from stack one by one and assign color Value t not colorable: spilled 58 29

30 SPILLING Estimating spill costs Address computation, memory operation, estimated execution frequency Spill decision based on different spill metrics Degree in the interference graph Spill cost/execution frequency Number of spill operations Combination of the above / experiment with different heuristics Spill implementation First, check whether it is colorable Insert a STORE after the definition Insert a LOAD before the use 59 SPILLED NODES COLORING Optimistic coloring (Briggs et al.) Improvement over original Chaitin s algorithm Spilled node may still be colorable Also push spilled node on stack according to some priority When pop off nodes for coloring, un-colorable nodes are stored in memory 2 Registers: 2-colorable 60 30

31 OTHER SPILLING STRATEGIES Clean spilling Spill value once per block, if possible Avoids redundant loads & stores Best of three spilling (Bernstein et al.) (Bernstein et al.) Simplify/Select is cheap relative to Build/Coalesce Try it with several different heuristics Rematerialization (Briggs/Cooper) Recognize values that are cheaper to recreate Rather than spill them, rematerialize them 61 EXAMPLE 3 4 COLORS {k,j} live in g = mem[j+12] h = k 1 f = g * h e = mem[j+8] m = mem[j+15] b = mem[f] c = e + 8 d = c k = m + 4 j = b {d,k,j} live out j h f k g d e b c m 62 31

32 EXAMPLE 3 4 COLORS {k,j} live in R4 = mem[r1+12] R3 = R2 1 R3 = R4 * R3 R4 = mem[r1+8] R2 = mem[r1+15] R3 = mem[r3] R1 = R4 + 8 R4 = R1 R2 = R2 + 4 R1 = R3 {d,k,j} live out j h f k g d e R1 (black): j,k R2 (red): k,m R3 (blue): b,h,f R4 (green): e,d,g b c m 63 REGISTER COALESCING Eliminate redundant register moves by merging live ranges Register coalescing: combining non-interfering nodes in the graph which are connected by a copy statement Get a new node with union of the edges of the two previous nodes Benefits of coalescing n i and n j Copies eliminated Reduce the degree of any node that interfere with both n i and n j Shrink the set of live ranges Interleave search for coalescing opportunities with simplify phase 64 32

33 EXAMPLE 3 REGISTER COALESCING {k,j} live in g = j+12 h = k 1 f = g * h e = j+8 m = j+15 b = f + 2 c = e + 8 d = c k = m + 4 j = b {d,k,j} live out j h f k g d e b c m 65 Move EXAMPLE 3 REGISTER COALESCING f e j h k d b c m f e g j k b m Effect: Node b and m: degree reduced Node k: does not really interfere with c h c/d g 66 33

34 REGISTER COALESCING Degree changes May make colorable graph uncolorable because merging adds additional constraints to the coloring Conservative coalescing: Combining R x R y to form R xy only if where R xy has < k nbrs of degree k [Briggs et al.] Imprecision Interference not included originally may be introduced Rebuild the interference graph The order of coalescing matters Coalescing two live ranges may prevent subsequent coalescing of other live ranges In principle, coalescing the most frequently executed copies first 67 INSTRUCTION SELECTION 34

35 INSTRUCTION SELECTION The problem of deciding the set of instructions included in the generated code Straightforward if no efficiency concerns a=a+1 Issues MOV a, R0 ADD #1, R0 MOV R0, a INC a Multiple choices in a rich instruction set Automatic selection Different cost considerations(speed, power, space) Cost may be influenced by surrounding context ISA may have additional constraints 73 EXAMPLE ASSEMBLY CODE SPIM language RISC: register-register Operand types: num, register, label, memory Arithmetic operations add r1, r2, r3 sub, mul, div addi r1,r2, c Addressing modes (load/store) Format (reg) imm imm(reg) symbol symbol +/- imm symbol +/- imm(reg) Address Computation contents of register immediate contents of register + imm address of symbol address of symbol +/- imm address of symbol +/- (contents of register + imm) 74 35

36 EXAMPLE ASSEMBLY CODE SPIM language Move - move r1,r2 Shifts/rotate rol, ror, sll, srl Unsigned operations, floating point operations, 75 VARIABLE ADDRESSING Variables are typically referenced as offsets from some known location (base) Stack pointer ARP: activation record pointer Local or global data space x IDENT <a,arp,4> IDENT <c,@g,4> name base address offset 76 36

37 EXAMPLE Tree for ax2 IDENT <a,arp,4> x NUMBER <2> Tree-walk Code lw $t0,4($sp) li $t1,2 mul $t0,$t0,$t1 Desired Code lw $t0,4($sp) mul $t0,$t0,2 Must combine information included in these two nodes This is a non-local problem 77 THE BIG PICTURE It s not easy to consider context information in the simple tree-walk approach Need pattern matching techniques Desired properties Efficient instruction selection (generate code quick) Generated code of good quality Some metric for good 78 37

38 PATTERN-MATCHING SCHEMES General idea: pattern matching Tree-oriented IR suggests pattern matching on trees Tree-patterns as input, each pattern maps to a targetmachine instruction sequence Linear IR suggests using some sort of string matching Strings as input, each string maps to a target-machine instruction sequence Use text matching or peephole matching In practice, both work well; matchers are quite different Today: peephole matching for instruction selection applied on linear IR Dragon 8.9, EAC Ch11.3 for tree-pattern matching 79 PEEPHOLE MATCHING Basic idea inspired by peephole optimization Compiler can discover improvements locally Look at a small set of adjacent operations Move a peephole over code & search for improvement Classic examples Store followed by load Simple algebraic identities Branch to branch Original code sw $t0,8($sp) lw $t2,8($sp) Improved code sw $t0,8($sp) move $t2,$t

39 PEEPHOLE MATCHING EXAMPLES Simple algebraic identities Original code addi $t3,$t2,0 mul $t4,$t3,$t1 Improved code mul $t4,$t2,$t1 Branch to branch Original code b L10 L10: b L11 Improved code b L11 L10: b L11 81 PEEPHOLE MATCHING Early systems used limited set of hand-coded patterns Window size (2~3 operations) ensured quick processing Modern peephole instruction selectors Increasingly complex ISAs led to a systematic approach Break problem into three tasks IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 82 39

40 EXPANDER Turns IR code into a low-level IR (LLIR) Operation-by-operation, template-driven rewriting LLIR form includes all direct effects Significant, albeit constant, expansion of size IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 83 SIMPLIFIER Looks at LLIR through window and rewrites it Performs local optimization within window Uses forward substitution, algebraic simplification, local constant propagation/folding, and dead-effect elimination IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM This is the heart of the peephole system Benefit of peephole optimization shows up in this step 84 40

41 MATCHER Compares simplified LLIR against a library of patterns Picks low-cost pattern that captures effects Must preserve LLIR effects, may add new ones Generates the assembly code output IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 85 EXAMPLE Original IR Code OP Arg 1 Arg 2 Result mul 2 y t 1 sub x t 1 w Expand LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 x,y,w are offsets (@x,@y,@w) from activation record pointer r 0 Can think of the r i as temporaries that we have an infinite number of 86 41

42 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r r 12 r 0 + r 11 Forward substitution: r 12 r 0 Also assume that we know r 11 dead after this substitution 88 42

43 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r r 12 r 0 + r 11 r 10 2 r 12 r 0 r 13 MEM(r 12 ) Forward substitution + dead code elimination(r 12 ) 89 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r 12 r 0 r 13 MEM(r 12 ) r 10 2 r 13 MEM(r 0 r 14 r 10 x r 13 Forward substitution + dead code elimination(r 10 ) 90 43

44 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r 13 MEM(r 0 r 14 r 10 x r 13 r 13 MEM(r 0 r 14 2 x r 13 r 91 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 13 MEM(r 0 r 14 2 x r 13 r 1 st op it has rolled out of window r 14 2 x r 13 r r 16 r 0 + r

45 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r r 16 r 0 + r 15 r 14 2 x r 13 r 16 r 0 r 17 MEM(r 16 ) 93 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r 16 r 0 r 17 MEM(r 16 ) r 14 2 x r 13 r 17 MEM(r 0 +@x) r 18 r 17 - r

46 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r 95 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r r 18 r 17 - r 14 r r 20 r 0 + r

47 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 18 r 17 - r 14 r r 20 r 0 + r 19 r 18 r 17 - r 14 r 20 r 0 MEM(r 20 ) r SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 18 r 17 - r 14 r 20 r 0 MEM(r 20 ) r 18 r 18 r 17 - r 14 MEM(r 0 r

48 EXAMPLE LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 Simplify LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r 18 5 operations instead of 12; 4 registers instead of EXAMPLE LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 Simplify LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r

49 EXAMPLE LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r 18 Match SPIM Code lw $t0,@y($sp) mul $t1,$t0,2 lw $t2,@x($sp) sub $t3,$t2,$t1 sw $t3,@w($sp) Introduced all memory operations & temporary names Turned out pretty good code 101 OTHER CONSIDERATIONS Dead values A list of dead values can be constructed during the expansion step Control flow operations Easy approach: clear the window when a control-flow operation or a label is reached Difficult approach: examine the context around branches Logic window Physical window is small: quick but miss some opportunity Alternatively, compiler can consider together definitions and uses of the same value as they are in a logical window

50 INSTRUCTION SCHEDULING INSTRUCTION SCHEDULING Problem: given a fixed set of instructions, how to decide their execution order Primary purpose: minimize execution time Globally picking the best order is NP-complete Possible impacts of changing computation order Hide latency Improve hardware utilization Superscalar and VLIW Reduce register pressure Restrictions Data dependences / control dependences Resource constraints

51 VLIW PROCESSORS VLIW (Very Long Instruction Word) Can issue more than one pipelined instruction per cycle A wide instruction holds several normal instructions, all of which are to be issued at the same time Typically, each corresponds to an operation on a different functional unit Scheduling more complicated Compilers expected to pack the wide instruction correctly and as efficient as possible 106 WHY DOES IT MATTER? Many operations have non-zero latencies Execution time is order-dependent Assumed latencies (conservative) Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 fadd 1 fmult 2 shift 1 branch 0 to 8 Loads & stores may or may not block > Non-blocking fill those issue slots Scheduler should hide the latencies

52 Why does it matter? Schedule 1 1 lw $t0,w 4 add $t0,$t0,$t0 5 lw $t1,x 8 mult $t0,$t0,$t1 9 lw $t1,y 12 mult $t0,$t0,$t1 13 lw $t1,z 16 mult $t0,$t0,$t1 18 sw $t0,w done at time 21 W W * 2 * X * Y * Z Schedule 2 1 lw $t0,w 2 lw $t1,x add $t0,$t0,$t0 5 mult $t0,$t0,$t1 6 lw $t1,y 9 mult $t0,$t0,$t1 10 lw $t1,z 13 mult $t0,$t0,$t1 15 sw $t0,w done at time 18 Schedule 3 1 lw $t0,w 2 lw $t1,x 3 lw $t2,y 4 add $t0,$t0,$t0 5 mult $t0,$t0,$t1 6 lw $t1,z 7 mult $t0,$t0,$t2 9 mult $t0,$t0,$t1 11 sw $t0,w done at time 14 Issue time Schedule 3 requires an extra register 109 SCHEDULING CONSTRAINTS Hardware resource Only a limited number of operations that rely on a particular type of hardware unit can be issued at each cycle Control dependence The execution of S1 depends on the result of S2 Data dependence Constrains that ensure data is produced and consumed in the correct order There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if and only if Both statements access the same memory location and at least one of them stores into it There is a feasible run-time execution path from S1 to S

53 IF ID EX MA WB PIPELINE QUICK REVIEW IF ID EX MA WB Five stages: IF: Instruction fetch ID: Instruction decode and register fetch EX: Execution and effective address calculation MA: Memory access WB: Write back Multiple instructions are overlapped in execution (at different stages) Instruction level parallelism (ILP) Pipeline bubble: the whole pipeline stalls due to structure/data/control hazards 111 IF ID EX MA WB CONTROL HAZARDS Branches often take some number of cycles to complete, creating delay slots Determine taken or not Compute target address if taken A compiler will try to fill these delay slots with valid and useful instructions (rather than nop) May need to examine the context around branches and move operations IF ID EX MA WB IF stall stall IF ID EX MA WB 112 Stall if branch is made 53

54 BRANCH SCHEDULING EXAMPLES Add R1, R2, R3 if R2 = 0 then Delay slot Sub R4, R5, R6 Add R1, R2, R3 if R1 = 0 then Delay slot if R2 = 0 then Add R1, R2, R3 From before op dest, src1, src2 Sub R4, R5, R6 ADD R1, R2, R3 if R1 = 0 then Sub R4, R5, R6 From target 113 STALL CYCLES Some architectures require some number of cycles between a condition instruction and the branch that uses the condition Ex: SPARC requires at least one non-floating point compare between a floating point compare and the branch instruction that uses the result

55 DATA HAZARD EXAMPLE Data hazard due to memory latency: lw R1,0(R2) IF ID EX MA WB op dst, src1, src2 add R3,R1,R4 IF ID stall EX MA WB lw R5,8(R2) IF stall ID EX MA WB Hazard removed by scheduling lw R1,0(R2) IF ID EX MA WB lw R5,8(R2) IF ID EX MA WB add R3,R1,R4 IF ID EX MA WB 115 DATA DEPENDENCIES Flow dependencies - read after write x = 4; y = x + 1 Antidependencies - write after read y = x + 1; x = 4; Output dependencies - write after write x = 4; x = y + 1; The latter two can be eliminated by renaming No value flows Flow dependencies are also called true dependencies

56 Example x = 4 y = 6 p = x + 2 z = y + p x = z y = p x = 4 y = 6 p = x + 2 Flow Output Anti y = p z = y + p x = z 117 DEPENDENCE GRAPH To capture the scheduling constraints of the code, build a dependence/precedence graph G=<N,E> Node n N is an operation with type(n) and delay(n) A directed edge e = (n 1,n 2 ) E if & only if n2 uses the result of n 1 Edge (n 1,n 2 ) can be decorated with delay(n 1 ) Inst a) b) a)ld w R1 b)ld y R2 3 3 c)div R2, R1 R1 d)ld z R2 e)mul R2, R1 R1 f)st R1 x Instruction latency/delay: LD, ST: 3; ADD, SUB: 1; MUL, DIV: 2 c) d) 2 2 e) f) 3 Dependence graphs may have multiple roots

57 INSTRUCTION SCHEDULING A correct schedule S maps each n N into an integer representing its cycle number, and S(n) > 0, for all n N If (n1,n2) E, S(n1 ) + delay(n1 ) S(n2 ) For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S) L(S) = maxn N (S(n) + delay(n)) Common goal of instruction scheduling is to find the shortest possible correct schedule S is time-optimal if L(S) L(S1 ), for all other schedules S1 A schedule might also be optimal in terms of registers, power,. 119 INSTRUCTION SCHEDULING Critical points All operands must be available Multiple operations can be ready Moving operations can change register lifetimes Placing uses near definitions can shorten register lifetimes Together, these issues make scheduling hard (NP-Complete) Block-level scheduling is the simple case Restricted to straight-line code Dominant algorithm: list scheduling Must consider issues at the boundaries

58 LIST SCHEDULING Lots of versions general framework: Renaming for anti-dependences (optional) Build a precedence graph Edges annotated with latency Compute a priority function over the nodes Use list scheduling to construct a schedule, one cycle at a time Use a queue for operations that are ready Initialized to include nodes without predecessors At each cycle, choose a ready operation and schedule it If multiple candidates, choose one based on the priority function Update the ready queue When an operation finishes, check to see whether new operations can be added into the ready queue 121 LOCAL LIST SCHEDULING Cycle 1 Ready leaves of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Keep track of instructions that could be issued Keep track of instructions currently being executed Removal in priority order schedule the instruction op has completed execution If successor s operands are ready, put it on Ready

59 SCHEDULING EXAMPLE 1. Build the precedence graph a: loadai r1 b: add r1,r1 r1 c: loadai r2 d: mult r1,r2 r1 e: loadai r2 f: mult r1,r2 r1 g: loadai r2 h: mult r1,r2 r1 i: storeai r1 a b d f c h i e g The Code The Precedence Graph 123 SCHEDULING EXAMPLE Build the precedence graph Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w b a Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 12 c 10 d e 9 8 g 7 f 5 h i 3 The Code The Precedence Graph

60 LIST SCHEDULING Ready = {a,c,e,g} Active = { } time = 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 13 a 12 b c d e 9 8 g 7 f 5 h 3 i The Precedence Graph 125 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {c,e,g} Active = {a} time = 2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c 10 d e 8 g 7 f 5 h 3 i The Code The Precedence Graph

61 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {e,g} Active = {a,c} 2 loadai r0,@x => r2 time = 3 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 10 9 c d 7 f 10 e 8 g 5 h 3 i The Code The Precedence Graph 127 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g} ->{b,g} Active = {a,c,e} - >{c,e} time = 4 2 loadai r0,@x => r2 3 loadai r0,@y => r3 Register name changed a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 10 9 c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph

62 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g}->{d,g} Active = {b,c,e}->{e} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 5 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 9 c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph 129 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g} Active = {d,e} ->{d} time = 6 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph

63 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} ->{f} Active = {d,g} -> {g} time = 7 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c d 7 f e 5 h 3 i g The Code The Precedence Graph 131 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {f,g} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 8 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d c f e 5 h 3 i g The Code The Precedence Graph

64 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {}->{h} Active = {f,g}->{} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 9 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d c f e 5 h 3 i g The Code The Precedence Graph 133 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {h} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 10 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 9 mult r1,r2 => r1 c e g f h 3 i The Code The Precedence Graph

65 LIST SCHEDULING Ready = {}->{i} Active = {h}->{} time = 11 1 loadai r0,@w => r1 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 7 mult r1,r3 => r1 9 mult r1,r2 => r1 c e g f h 3 i The Code The Precedence Graph 135 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {i} time = 12 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 9 mult r1,r2 => r1 11 storeai r1 => r1,@w c e g f h 3 i The Code The Precedence Graph

66 SCHEDULING EXAMPLE 1. Build the precedence graph 2. Determine priorities: longest latencyweighted path 3. Perform list scheduling 1) a: l o ad A I r r 1 2) c: l o ad A I r x r 2 3) e: l o ad A I r y r 3 4) b: a d d r 1, r 1 r 1 5) d: m ul t r 1, r 2 r 1 6) g: l o ad A I r r 2 7) f: m ul t r 1, r 3 r 1 9) h: m ul t r 1, r 2 r 1 11) i: s t o r e A I r 1 r The Code New register name used b a 12 c 10 d e 9 8 g 7 f 5 h The Precedence Graph i PRIORITY FUNCTIONS Various ways to decide the rank of a node Longest latency-weighted path Prioritize critical paths The number of immediate successors The total number of descendants Latency A node s rank is higher if it contains the last use of a value Tend to decrease the demand of registers Unfortunately, none dominates the others in terms of overall schedule quality Use of multiple priorities allows tiebreaks

67 FORWARD & BACKWARD SCHEDULING List scheduling breaks down into two distinct classes Forward list scheduling Start with available operations Work forward in time Ready all operands available Neither always win in practice Example in EAC Ch Backward list scheduling Start with no successors Work backward in time Ready latency covers uses A compiler can try several versions of list scheduling and choose the shortest schedule 139 EXAMPLE Forward and backward can produce different results loadi 1 lshift loadi 2 loadi 3 loadi 4 Latency to the cbr add 1 add 2 add 3 add 4 addi Subscript to identify cmp store 1 store 2 store 3 store 4 store 5 1 cbr Block from SPEC benchmark go Operation load loadi add addi store cmp Latency

68 F o r w a r d S c h e d u l e EXAMPLE Int Int Mem 1 loadi 1 lshift 2 loadi 2 loadi 3 3 loadi 4 add 1 4 add 2 add 3 5 add 4 addi store 1 6 cmp store 2 7 store 3 8 store 4 9 store cbr B a c k w a r d S c h e d u l e Int Int Mem 1 loadi 4 2 addi lshift 3 add 4 loadi 3 4 add 3 loadi 2 store 5 5 add 2 loadi 1 store 4 6 add 1 store 3 7 store 2 8 store cmp 12 cbr 13 Using latency to root as the priority 141 SCHEDULING LARGER REGIONS Within a basic block, list scheduling works well Move beyond basic blocks improves the quality of generated code List scheduling forms the basic of most algorithms working on larger regions of code Critical issue is to guarantee that moving operations does not change the externally observable program behaviors, given any possible control flow in the program

69 SUPERLOCAL SCHEDULING Scheduling over EBB Paths through EBBs form straight-line code Treated as if they were single blocks Moving operations across block boundaries must be careful B 2 e f B 1 a b c d B 3 g Two non-trivial paths: {B1,B2,B4 } & {B1,B3 } A compiler can schedule {B1,B2,B4} first, then schedule B3 with B1 as a fixed prefix B 4 h i B 6 l B 5 j k 143 SUPERLOCAL SCHEDULING Having B 1 in both paths {B 1,B 2,B 4 } & {B 1,B 3 } causes conflicts Moving an op out of B 1 to B 2 Forward motion or downward motion Must insert compensation code in B 3 Increases code space B 4 h i B 2 c,e f B 1 B 5 a b c d j k B 3 no c here! g B 6 l add c

70 SUPERLOCAL SCHEDULING Having B 1 in both paths {B 1,B 2,B 4 } & {B 1,B 3 } causes conflicts Moving an op into B 1 Backward motion or upward motion Lengthens {B 1,B 3 } B 2 e f B 1 a b c d,f B 3 This makes the path even longer! undo f g Adds computation to {B 1,B 3 } May also need compensation code B 4 h i B 5 j k Renaming may avoid undo f B 6 l 145 Superlocal Scheduling More aggressive superlocal scheduling Clone blocks to create more context B 1 a b c d B 2 e f B 3 g Join points create blocks that must work in multiple contexts B 4 h i B 5 j k 2 paths B 6 l 3 paths

71 Superlocal Scheduling More aggressive superlocal scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 4 h i B 2 e f B 1 B 5a j k a b c d B 3 B 5b g j k B 6a l B 6b l B 6c l 147 Superlocal Scheduling More aggressive superlocal scheduling Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5a }, {B 1,B 3,B 5b } Pay heed to compensation code B 2 e f B 1 a b c d B 3 g Works well for forward motion B 4 h i l B 5a j k l B 5b j k l

72 TRACE SCHEDULING Start with execution counts for edges Obtained by profiling Pick the hot path B 1 7 a b c d 10 3 B 2 e f B 3 g B 4 h i B 5 j k 5 5 B 6 l 149 Trace Scheduling Pick the hot path 10 B 1,B 2,B 4,B 6 Schedule it Compensation code in B 3,B 5 if needed If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles B 4 h i B B 6 e f l B 1 7 B a b c d j k B g

73 SCHEDULING LOOPS Loops play a critical role in most computationintensive tasks Main target of compiler optimizations Scheduling can move code around, in particular, find instructions to fill branch delay slots for loops Still, small loops may have too few operations to move and to keep the underlying FU busy Loop scheduling techniques Loop unrolling Software pipelining 151 LOOP UNROLLING Main idea: loop body is replicated several times, and the increment of the loop variable is adjusted to match Effects Loop overhead is reduced Larger loop body: more ILP/ scheduling freedom Register usage within the loop body increased do i=1 to n by 1 a(i) = a(i)*s end do i=1 to n by 2 a(i) = a(i)*s a(i+1) = a(i+1)*s end Additional code needed at the end if n is not even

74 LOOP UNROLLING EXAMPLE do i=1 to n by 1 a(i) = a(i)*s end add r arp, 1 L1: loada r a 2 addi 3 cmp_lt r up r cc 4 mult r a,r s r a 5 stall 6 cbr r cc L1, L2 7 storeao r a (-4) Operation Latency loada 3 storeao 3 loadao 1 add 1 mult 3 cmp/cbr 1 Scheduling already considered load- and branch- latency 7 cycle per iteration 3-inst loop overhead 1 stall due to multiplication latency L2: 153 LOOP UNROLLING EXAMPLE do i=1 to n by 2 a(i) = a(i)*s a(i+1) = a(i+1)*s end add r arp, 1 L1: loada r a 2 loadao r b 3 addi 4 mult r a,r s r a 5 mult r b,r s r b 6 cmp_lt r up r cc 7 storeao r a (-8) 8 cbr r cc L1, L2 9 storeao r b (-4) L2: Operation Latency loada 3 storeao 3 loadao 1 add 1 mult 3 cmp/cbr 1 Scheduling after loop unrolling Note the instruction changes 9/2 = 4.5 cycle per iteration 1.5-inst loop overhead 0 stall Need one more register

LOOP UNROLLING Compiler needs to Check whether loop iterations are independent Rename registers to avoid name dependencies Eliminate extra test and branches and adjust loop iteration and termination

75 LOOP UNROLLING Compiler needs to Check whether loop iterations are independent Rename registers to avoid name dependencies Eliminate extra test and branches and adjust loop iteration and termination code Adjust the load/store offset Schedule the code Pros and cons Pros: reduced loop overhead Cons: increased code size; register pressure Difficult to choose the unrolling factor 155 SOFTWARE PIPELINING Inst executed together Idea: combine instructions from different loop iterations to hide the latency and keep all FU busy

76 SOFTWARE PIPELINING Symbolically unroll the loop and select instructions from different iterations No worry about registers No worry about branch controls It0 It1 It2 LD Start-up code MULT LD SD MULT LD SD MULT SD Loop body: Fill in pipeline Finish-up code 158 SOFTWARE PIPELINING EXAMPLE Before: unrolled After: software pipelined loada r a mult r a,r s r a storea r a loadao r b mult r b,r s r b storeao r b loadao r c mult r c,r s r c storeao r c addi cmp_lt r up r cc cbr r cc L1, L2 loada r a mult r a,r s r b loadao r a L1:storeA r b //iter i mult r a,r s r b //iter i+1 loadao r a //iter i+2 addi cmp_lt r up r cc cbr r cc L1, L2 L2:storeA r b mult r a,r s r b storeao r b Value need to be adjusted

77 SOFTWARE PIPELINING VS. UNROLLING Different execution pattern 160 SOFTWARE PIPELINING Software pipelining consumes less code space than loop unrolling Constraints Critical resources Data dependencies do i=1 to n by 1 s = a(i)+ s end Loop iterations are not independent of each other Need dependence analysis to identify

78 SOFTWARE PIPELINING IMPLEMENTATION Unroll and compact Unroll copies of loop body and search for repeating patterns Insert start-up and clean-up code Window scheduling Build a dependence graph for two copies of basic block of the loop Create a window that contains one complete iteration Slide window around looking for best schedule More details: Dragon Ch WINDOW SCHEDULING EXAMPLE swf lwf Prologue swf lwf swf lwf fadd sub Body fadd sub fadd sub swf lwf swf lwf fadd sub Epilogue fadd sub Reference: S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann Publishers,

Material Covered on the Final

Material Covered on the Final On the final exam, you are responsible for: Anything covered in class, except for stories about my good friend Ken Kennedy All lecture material after the midterm ( below the