CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code

Size: px
Start display at page:

Download "CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code"

Transcription

1 CODE GENERATION Goal Translate intermediate code into target code Interplay between Register Allocation Instruction Selection Instruction Scheduling 1 REGISTER ALLOCATION 1

2 REGISTER ALLOCATION Motivation Instructions involving register operands are usually more efficient than those involving operands in memory The number of physical registers is limited Moving data between register and memory/cache is expensive The problem of deciding where to store values at each point in the code Register allocation: which values reside in registers Register assignment: which register to use for a particular value This distinction is often lost in the literature 3 REGISTER ALLOCATION GOAL Effectively use the limited number of registers Still need to produce correct code Want to generate efficient code Registers are faster than memory Minimize loads/stores from/to memory Minimize space used to hold spilled values RISC vs. CISC Complications Register classes Special-purpose registers Operators have additional constraints on registers A certain number of registers need to be reserved 4 2

3 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b Register-register addressing mode Cost assumption: load/store = 2 and add/mult = 1 LOAD R1,b LOAD R2,c ADD R1,R1,R2 STORE R1,a LOAD R1,a MUL R1,R1,R1 STORE R1,t1 LOAD R1,t1 LOAD R2,a ADD R1,R1,R2 STORE R1,b LOAD R1,t1 LOAD R2,b MUL R1,R1,R2 STORE R1,c cost = 26 LOAD R1,b LOAD R2,c ADD R1,R1,R2 STORE R1, a MUL R2,R1,R1 ADD R1,R1,R2 MUL R2,R1,R2 STORE R1,b STORE R2,c cost = 14 5 HIGH-LEVEL PROCESS Problem At each instruction, decide which values to keep in registers Simple if values registers Harder if values > registers If there are not enough registers, decide which registers to spill to memory Insert code to move values between registers and memory The compiler must automate this process 6 3

4 COMPLEXITY Can we do this optimally? (on real code?) Local Allocation Local Assignment Simplified cases O(n) Single size, no spilling O(n) Real cases NP-Complete Two sizes NP-Complete Global Allocation NP-Complete for 1 register NP-Complete for k registers Global Assignment NP-Complete Local register allocation operates on basic blocks Real compilers face real problems 7 LOCAL REGISTER ALLOCATION Register allocation within a single basic block Assumptions that simplify the discussion No registers inherited from predecessors No values left in registers at the end of the basic block Only a single register class: general-purpose registers Two approaches Top-down allocator Work from external derived information of what is important Bottom-up allocator Work from synthesized knowledge about problem instances 8 4

5 TOP-DOWN ALLOCATOR General idea Keep heavily used values in registers Algorithm Rank values by number of occurrences First pass: tally the use counts for all virtual registers ex: S blocks B use(x,b) + 2*live(x,B) where use(x,b) counts the number of times x is used in B live(x,b) = 1 only if x is live on exit from B and given a value in B Allocate first k values to registers Assuming k is the number of available registers Rewrite code to reflect these choices Second pass: store/load inserted Common technique of 60 s and 70 s An allocated register is dedicated to a value for the entire basic block 9 BOTTOM-UP ALLOCATOR General idea: Keep values used soon in registers Algorithm Start with empty register set Load on demand When no register is available, free one Focus on replacement rather than allocation Spill the value whose next use is farthest in the future Sound familiar? Think page replacement

6 BASIC BOTTOM-UP ALGORITHM Local liveness information A variable is live if it has a future use A non-live value in a register can be discarded, freeing the register it currently occupies All variables stored in memory at the end of the block Data structures: Register descriptor - register status (empty, full) and contents (one or more "values") Address descriptor - the location (or locations) where the current value for a variable can be found (register, memory) 11 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b t2 = c + b a = t2 + t2 live = {a,b,c} All non-temporary variables stored back in memory unless we can prove they are not live after the block 12 6

7 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS (a) from the set since this is a definition. Add RHS variables to the set since this is a use. 13 EXAMPLE a = b + c t1 = a * a b = t1 + a c = t1 * b live = {b,c} t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS (t2) from the set since this is a definition. Add RHS variables to the set since this is a use. 14 7

8 EXAMPLE live = {b,c} a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = { b,t1} c = t1 * b live = {b,c} t2 = c + b live = {t2,b,c} a = t2 + t2 live = {a,b,c} Remove LHS from the set since this is a definition. Add RHS variables to the set since this is a use. 15 OPERATION X = Y OP Z Assume a register-only addressing mode (i.e. RISC) Each operation needs all its operands in register before execution A register must be assigned for target (result) Iterate over operations in the block If y or z already in a register, re-use it If either y or z is not in a register, load it into a free register All full select the one used furthest in the future to spill If y or z not live after this instruction, free its register Allocate a register for x All full select the one used furthest in the future to spill 16 8

9 EXAMPLE live = {b,c} 1. a = b + c live = {a} 2. t1 = a * a live = {a,t1} 3. b = t1 + a live = { b,t1} 4. c = t1 * b live = {b,c} 5. t2 = c + b live = {t2,b,c} 6. a = t2 + t2 live = {a,b,c} 17 EXAMPLE Initially Three registers: (R1, R2, R3) = ( -, -, -) all empty Current values: (a,b,c,t1,t2) = (m,m,m, -, -) Instruction 1: a = b + c Live = {a}: no need to keep b or c in register LOAD R1,b LOAD R2,c ADD R1,R1,R2 //R1 = R1 + R2 Registers: (R1, R2, R3) = (a, -, -) Current values: (a, b, c, t1, t2) = (R1,m,m, -, -) 18 9

10 Example Instruction 2: t1 = a * a After instruction 1: Registers: (R1, R2, R3) = (a, -, -) After instruction 2: Live = {a,t1} MUL R2,R1,R1 //R2 = R1* R1 Registers: (R1,R2,R3) = (a,t1, -) Current values: (a,b,c,t1,t2) = (R1,m,m,R2, -) Instruction 3: b = t1 + a Live = {b,t1} Register allocated to b is R1 (since a is not live after this inst) ADD R1, R1, R2 //R1 = R1 + R2 Registers: (R1,R2,R3) = (b,t1, -) Current values: (a,b,c,t1,t2) = (-,R1,m,R2, -) 19 Example Instruction 4: c = t1 * b Live = {b,c} [Registers: (R1,R2,R3) = (b,t1, -)] Register allocated to c is R2 (since t1 is not live after this inst) MUL R2, R1, R2 //R2 = R1 * R2 Registers: (R1,R2,R3) = (b,c, -) Current values: (a,b,c,t1,t2) = (-,R1,R2, -, -) Instruction 5: t2 = c + b Live = {b,c,t2} Register allocated to t2 is R3 ADD R3,R1,R2 //R3 = R1 + R2 Registers: (R1,R2,R3) = (b,c,t2) Current values: (a,b,c,t1,t2) = (-,R1,R2, -,R3) 20 10

11 Example Instruction 6: a = t2 + t2 Live = {a,b,c} [ Registers: (R1,R2,R3) = (b,c,t2) ] ADD R3,R3,R3 Registers: (R1,R2,R3) = (b,c,a) Current values: (a,b,c,t1,t2) = (R3,R1,R2, -,-) Since end of block, move variables to memory MOV R3,a MOV R1,b MOV R2,c Registers: (R1,R2,R3) = (-,-,-) Current values: (a,b,c,t1,t2) = (m,m,m,-,-) 21 EXAMPLE SOURCE a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = {b,t1} c = t1 * b live = {b,c} t2 = c + b live = {b,c,t2} a = t2 + t2 live = {a,b,c} TARGET LOAD R1,b LOAD R2,c ADD R1,R1,R2 MUL R2, R1,R1 ADD R1,R1,R2 MUL R2,R1, R2 ADD R3,R1,R2 ADD R3,R3,R3 MOV R3,a MOV R1,b MOV R2,c REGISTER (a, -, -) (a,t1, -) (b,t1, -) (b,c, -) (b,c,t2) (b,c,a) (-,-,-) 22 11

12 Example with Spilling For the same sequence, if we only have 2 physical registers After instruction 4: (Slide 16) Registers: (R1,R2) = (b,c) Current values: (a,b,c,t1,t2) = (-,R1,R2, -, -) Instruction 5: t2 = c + b Live = {b,c,t2} Distance to the next use: dist(b)=dist(c) = (no use in this block), pick either to spill MOV b,r1 SPILL! ADD R1,R1,R2 Registers: (R1,R2) = (t2,c) Current values: (a,b,c,t1,t2) = (-,m,r2, -,R1) 23 LIVE EXAMPLE SOURCE a = b + c live = {a} t1 = a * a live = {a,t1} b = t1 + a live = {b,t1} c = t1 * b live = {b,c} t2 = c + b live = {b,c,t2} a = t2 + t2 live = {a,b,c} TARGET LOAD R1,b LOAD R2, ADD R1,R1,R2 MUL R2, R1,R1 ADD R1,R2,R1 MUL R2,R1, R2 MOV R1,b ADD R1,R1,R2 ADD R1,R1,R1 MOV R1,a MOV R2,c REGISTER (a, -,) (a,t1) (b,t1) (b,c) (t2,c) (a,c) (-,-) 24 12

13 SPILLING CONSIDERATIONS Goal: minimize spilling cost Heuristics 1: spill values that is used farthest in the future Heuristics 2: spill clean values instead of dirty values A value that does not change and hence need not be stored on a spill is called clean; otherwise dirty Taking into account the distinction between clean and dirty values makes local allocation NP-hard No guarantee which heuristic will produce a better allocation for all cases 25 EXAMPLE Initial: 2 registers (x1,x2), x1 clean, x2 dirty Reference string 1: x3 x1 x2 (x3 clean) Spill furthest (x2): Store x2 Load x3 Load x2 Spill clean (x1): Load x3 Load x1 (x1,x3) (x3,x2) (x1,x2) 26 13

14 EXAMPLE Initial: 2 registers (x1,x2), x1 clean, x2 dirty Reference String 2: x3 x1 x3 x1 x2 (x3 clean) Spill furthest: Store x2 Load x3 Load x2 (x1,x3) Spill clean: Load x3 (x3,x2) Load x1 (x1,x2) Load x3 (x3,x2) Load x1 (x1,x2) 27 REGISTER ALLOCATION... store r4 x load x r1... Given global liveness, this is an assignment problem, not an allocation problem! Local register allocations assume that variables are stored/loaded to/from memory at block boundaries Could replace a load with a move Good assignment would obviate the move What s harder across multiple blocks? Must build a control-flow graph to understand interblock flow and assign registers in a consistent 28 way 14

15 REGISTER ALLOCATION... store r4 x... store r3 x load x r1... What if one predecessor has x in a register, but the other does not? A more complex scenario Block with multiple predecessors in the control-flow graph Must get the right values in the right registers from either predecessor Other issues How to determine the execution frequency of an inst? How to define furthest next references? 29 REGISTER ALLOCATION Graph coloring paradigm Local, Superlocal (EBB), Global Build an interference graph Computing global LIVE information Construct a K-coloring K is the number of available physical registers Spill some variables if k-coloring cannot be constructed Minimal coloring is NP-Complete Map colors onto physical registers 30 15

16 LIVE RANGES A single live range contains a set of definitions and uses Concept relies on the notion of liveness: variable x is live at point p if it has been defined and there is path leading from p to a use of x Live range should start with a definition and end with the last use of that definition A specific variable may have many distinct live ranges Register allocation based on live ranges can place distinct live ranges in different registers A variable can be stored in different registers at distinct points in the program execution 31 LIVE RANGES EXAMPLE live = {b,c} 1:a = b + c live = {a} 2:t1 = a * a live = {a,t1} 3:b = t1 + a live = { b,t1} 4:c = t1 * b live = {b,c} 5:t2 = c + b live = {t2,b,c} 6:a = t2 + t2 live = {a,b,c} Live ranges: [1,3], [6, exit] -- a [entry,1],[3,exit] -- b [entry,1],[4,exit] -- c [2,4] -- t1 [5,6] -- t2 In this sequence, each definition introduces a new live range: sounds familiar? Live variables that need to be propagated to next basic block (LIVEOUT) 32 16

17 DISCOVERING GLOBAL LIVE RANGES SSA provides a natural starting point Each definition gets assigned a new live range Variables live across multiple blocks: a single live range for definition and uses A use that might be reached by multiple definitions: -functions A single live range is assigned to both parameters and target of a -function... X0 =... X1 =... Rx =... Rx = X2 = (X0, X1)... = X2... = Rx 33 INTERFERENCE Definition: Two values interfere if at some point in the program both are simultaneously live Overlapped live ranges If x and y interfere, they cannot occupy the same register Construct interference graph GI = <N,E> Nodes N in GI represent live ranges Edges E in GI represent interferences between live ranges For x, y N, <x,y> E iff x and y interfere A k-coloring of GI can be mapped into an allocation to k registers 34 17

18 BUILDING INTERFERENCE GRAPHS Algorithm to construct G I = <N,E> For each live range LR i, create a node N i N For each basic block b LIVENOW = LIVEOUT(b) for each operation o i in b in reverse order, with form o i : op i LR a, LR b LR c foreach LR j in LIVENOW add(lr c, LR j ) to E //interference remove LR c from LIVENOW add LR a and LR b to LIVENOW 35 INTERFERENCE GRAPH EXAMPLE live = {b,c} a 1 = b 1 + c 1 live = {a} t1 = a 1 * a 1 live = {a,t1} b 2 = t1 + a 1 live = { b,t1} c 2 = t1 * b 2 live = {b,c} t2 = c 2 + b 2 live = {b,c,t2} a 2 = t2 + t2 live = {a,b,c} a 1 t1 a 2 b 2 b 1 c 1 c 2 t

19 GRAPH COLORING PROBLEM Problem: A graph G is said to be k-colorable iff the nodes can be labeled with integers 1 k so that no edge in G connects two nodes with the same label K = 3 K = 4 No color for this node A 4-colorable example 37 Determining if a graph is k-colorable is NP-complete for k>2 CHAITIN S ALGORITHM Bottom-up coloring in EAC for k registers 1. Compute liveness information 2. Create interference graph G 3. Simplify Pick any node n with < k neighbors, remove it along with all edges incident to it from the graph and push it onto a stack This will lower the degree of n s neighbors If (G - n) can be colored with k colors, so can G If we reduce the entire graph, goto step 5; otherwise repeat step 3 38 Chaitin, Register Allocation and Spilling via Graph Coloring, SIGPLAN CC, June

20 Chaitin s Algorithm (cont d) 4. Spill After step 3, if the graph is not empty: we get to the point where we are left with only nodes with a degree >= k Mark some node for potential spilling, remove, back to step 3 5. Assign colors Starting with empty graph, rebuild graph by popping elements off the stack and assigning a color different from neighbors Potential spill nodes may or may not be colorable 39 EXAMPLE Interference graph from previous example a 1 Assume k = 3 Pick a 1 t1 a 2 c 2 b 2 t

21 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 t2 a 1 Remove a 1 Next: pick t1 41 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 t2 t1 a 1 Remove t1 Next: pick a

22 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 a 2 t1 a 1 Remove a 2 Next: pick b 2 b 2 t2 43 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 b 2 a 2 t1 a 1 Remove b 2 Next: pick c 2 b 2 t

23 EXAMPLE Assume k = 3 a 1 a 2 c 2 t1 c 2 b 2 a 2 t1 a 1 Remove c 2 b 2 t2 45 EXAMPLE Assume k = 3 a 1 t2 c 2 b 2 a 2 t1 a 1 t1 a 2 b 2 Remove t2: empty graph! c 2 t

24 EXAMPLE Assume k = 3 1: black 2: red 3: blu t2 c 2 b 2 a 2 t1 a 1 Pop and rebuild 47 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 b 2 a 2 t1 a 1 Pop and rebuild t

25 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 b 2 a 2 t1 a 1 Pop and rebuild t2 49 EXAMPLE Assume k = 3 1: black 2: red 3: blu c 2 a 2 t1 a 1 Pop and rebuild b 2 t

26 EXAMPLE Assume k = 3 1: black 2: red 3: blu a 2 c 2 b 2 t2 t1 a 1 Pop and rebuild 51 EXAMPLE Assume k = 3 1: black 2: red 3: blu t1 a 2 c 2 b 2 t2 a 1 Pop and rebuild 52 26

27 EXAMPLE Assume k = 3 1: black 2: red 3: blu a 1 t1 a 2 b 1 c 1 c 2 b 2 t2 a 1 = b 1 + c 1 t1 = a 1 * a 1 b 2 = t1 + a 1 c 2 = t1 * b 2 t2 = c 2 + b 2 a 2 = t2 + t2 R 1 = R 1 + R 2 R 2 = R 1 * R 1 R 3 = R 2 + R 1 R 2 = R 2 * R 3 R 1 = R 2 + R 3 R 1 = R 1 + R 1 53 EXAMPLE 2 a b d k=4: only four physical registers available t f e Can t simplify no vertex has fewer than 4 neighbors Choose t as potential spill : highest degree 54 27

28 EXAMPLE 2 a b d k=4: only four physical registers available t f e After removing t, f and b have a degree <4 55 EXAMPLE 2 a b d k=4: only four physical registers available t f e Stack: f Now all nodes with a degree <

29 EXAMPLE 2 a b d k=4: only four physical registers available t f e Stack: f e d b a 57 EXAMPLE 2 t a f b d e k=4: only four physical registers available Stack: f e d b a Pop up from stack one by one and assign color Value t not colorable: spilled 58 29

30 SPILLING Estimating spill costs Address computation, memory operation, estimated execution frequency Spill decision based on different spill metrics Degree in the interference graph Spill cost/execution frequency Number of spill operations Combination of the above / experiment with different heuristics Spill implementation First, check whether it is colorable Insert a STORE after the definition Insert a LOAD before the use 59 SPILLED NODES COLORING Optimistic coloring (Briggs et al.) Improvement over original Chaitin s algorithm Spilled node may still be colorable Also push spilled node on stack according to some priority When pop off nodes for coloring, un-colorable nodes are stored in memory 2 Registers: 2-colorable 60 30

31 OTHER SPILLING STRATEGIES Clean spilling Spill value once per block, if possible Avoids redundant loads & stores Best of three spilling (Bernstein et al.) (Bernstein et al.) Simplify/Select is cheap relative to Build/Coalesce Try it with several different heuristics Rematerialization (Briggs/Cooper) Recognize values that are cheaper to recreate Rather than spill them, rematerialize them 61 EXAMPLE 3 4 COLORS {k,j} live in g = mem[j+12] h = k 1 f = g * h e = mem[j+8] m = mem[j+15] b = mem[f] c = e + 8 d = c k = m + 4 j = b {d,k,j} live out j h f k g d e b c m 62 31

32 EXAMPLE 3 4 COLORS {k,j} live in R4 = mem[r1+12] R3 = R2 1 R3 = R4 * R3 R4 = mem[r1+8] R2 = mem[r1+15] R3 = mem[r3] R1 = R4 + 8 R4 = R1 R2 = R2 + 4 R1 = R3 {d,k,j} live out j h f k g d e R1 (black): j,k R2 (red): k,m R3 (blue): b,h,f R4 (green): e,d,g b c m 63 REGISTER COALESCING Eliminate redundant register moves by merging live ranges Register coalescing: combining non-interfering nodes in the graph which are connected by a copy statement Get a new node with union of the edges of the two previous nodes Benefits of coalescing n i and n j Copies eliminated Reduce the degree of any node that interfere with both n i and n j Shrink the set of live ranges Interleave search for coalescing opportunities with simplify phase 64 32

33 EXAMPLE 3 REGISTER COALESCING {k,j} live in g = j+12 h = k 1 f = g * h e = j+8 m = j+15 b = f + 2 c = e + 8 d = c k = m + 4 j = b {d,k,j} live out j h f k g d e b c m 65 Move EXAMPLE 3 REGISTER COALESCING f e j h k d b c m f e g j k b m Effect: Node b and m: degree reduced Node k: does not really interfere with c h c/d g 66 33

34 REGISTER COALESCING Degree changes May make colorable graph uncolorable because merging adds additional constraints to the coloring Conservative coalescing: Combining R x R y to form R xy only if where R xy has < k nbrs of degree k [Briggs et al.] Imprecision Interference not included originally may be introduced Rebuild the interference graph The order of coalescing matters Coalescing two live ranges may prevent subsequent coalescing of other live ranges In principle, coalescing the most frequently executed copies first 67 INSTRUCTION SELECTION 34

35 INSTRUCTION SELECTION The problem of deciding the set of instructions included in the generated code Straightforward if no efficiency concerns a=a+1 Issues MOV a, R0 ADD #1, R0 MOV R0, a INC a Multiple choices in a rich instruction set Automatic selection Different cost considerations(speed, power, space) Cost may be influenced by surrounding context ISA may have additional constraints 73 EXAMPLE ASSEMBLY CODE SPIM language RISC: register-register Operand types: num, register, label, memory Arithmetic operations add r1, r2, r3 sub, mul, div addi r1,r2, c Addressing modes (load/store) Format (reg) imm imm(reg) symbol symbol +/- imm symbol +/- imm(reg) Address Computation contents of register immediate contents of register + imm address of symbol address of symbol +/- imm address of symbol +/- (contents of register + imm) 74 35

36 EXAMPLE ASSEMBLY CODE SPIM language Move - move r1,r2 Shifts/rotate rol, ror, sll, srl Unsigned operations, floating point operations, 75 VARIABLE ADDRESSING Variables are typically referenced as offsets from some known location (base) Stack pointer ARP: activation record pointer Local or global data space x IDENT <a,arp,4> IDENT <c,@g,4> name base address offset 76 36

37 EXAMPLE Tree for ax2 IDENT <a,arp,4> x NUMBER <2> Tree-walk Code lw $t0,4($sp) li $t1,2 mul $t0,$t0,$t1 Desired Code lw $t0,4($sp) mul $t0,$t0,2 Must combine information included in these two nodes This is a non-local problem 77 THE BIG PICTURE It s not easy to consider context information in the simple tree-walk approach Need pattern matching techniques Desired properties Efficient instruction selection (generate code quick) Generated code of good quality Some metric for good 78 37

38 PATTERN-MATCHING SCHEMES General idea: pattern matching Tree-oriented IR suggests pattern matching on trees Tree-patterns as input, each pattern maps to a targetmachine instruction sequence Linear IR suggests using some sort of string matching Strings as input, each string maps to a target-machine instruction sequence Use text matching or peephole matching In practice, both work well; matchers are quite different Today: peephole matching for instruction selection applied on linear IR Dragon 8.9, EAC Ch11.3 for tree-pattern matching 79 PEEPHOLE MATCHING Basic idea inspired by peephole optimization Compiler can discover improvements locally Look at a small set of adjacent operations Move a peephole over code & search for improvement Classic examples Store followed by load Simple algebraic identities Branch to branch Original code sw $t0,8($sp) lw $t2,8($sp) Improved code sw $t0,8($sp) move $t2,$t

39 PEEPHOLE MATCHING EXAMPLES Simple algebraic identities Original code addi $t3,$t2,0 mul $t4,$t3,$t1 Improved code mul $t4,$t2,$t1 Branch to branch Original code b L10 L10: b L11 Improved code b L11 L10: b L11 81 PEEPHOLE MATCHING Early systems used limited set of hand-coded patterns Window size (2~3 operations) ensured quick processing Modern peephole instruction selectors Increasingly complex ISAs led to a systematic approach Break problem into three tasks IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 82 39

40 EXPANDER Turns IR code into a low-level IR (LLIR) Operation-by-operation, template-driven rewriting LLIR form includes all direct effects Significant, albeit constant, expansion of size IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 83 SIMPLIFIER Looks at LLIR through window and rewrites it Performs local optimization within window Uses forward substitution, algebraic simplification, local constant propagation/folding, and dead-effect elimination IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM This is the heart of the peephole system Benefit of peephole optimization shows up in this step 84 40

41 MATCHER Compares simplified LLIR against a library of patterns Picks low-cost pattern that captures effects Must preserve LLIR effects, may add new ones Generates the assembly code output IR Expander LLIR Simplifier LLIR Matcher ASM IR LLIR LLIR LLIR LLIR ASM 85 EXAMPLE Original IR Code OP Arg 1 Arg 2 Result mul 2 y t 1 sub x t 1 w Expand LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 x,y,w are offsets (@x,@y,@w) from activation record pointer r 0 Can think of the r i as temporaries that we have an infinite number of 86 41

42 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r r 12 r 0 + r 11 Forward substitution: r 12 r 0 Also assume that we know r 11 dead after this substitution 88 42

43 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r r 12 r 0 + r 11 r 10 2 r 12 r 0 r 13 MEM(r 12 ) Forward substitution + dead code elimination(r 12 ) 89 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r 12 r 0 r 13 MEM(r 12 ) r 10 2 r 13 MEM(r 0 r 14 r 10 x r 13 Forward substitution + dead code elimination(r 10 ) 90 43

44 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 10 2 r 13 MEM(r 0 r 14 r 10 x r 13 r 13 MEM(r 0 r 14 2 x r 13 r 91 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 13 MEM(r 0 r 14 2 x r 13 r 1 st op it has rolled out of window r 14 2 x r 13 r r 16 r 0 + r

45 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r r 16 r 0 + r 15 r 14 2 x r 13 r 16 r 0 r 17 MEM(r 16 ) 93 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r 16 r 0 r 17 MEM(r 16 ) r 14 2 x r 13 r 17 MEM(r 0 +@x) r 18 r 17 - r

46 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 14 2 x r 13 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r 95 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 17 MEM(r 0 +@x) r 18 r 17 - r 14 r r 18 r 17 - r 14 r r 20 r 0 + r

47 SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 18 r 17 - r 14 r r 20 r 0 + r 19 r 18 r 17 - r 14 r 20 r 0 MEM(r 20 ) r SIMPLIFIER (3-OPERATION WINDOW) LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 r 18 r 17 - r 14 r 20 r 0 MEM(r 20 ) r 18 r 18 r 17 - r 14 MEM(r 0 r

48 EXAMPLE LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 Simplify LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r 18 5 operations instead of 12; 4 registers instead of EXAMPLE LLIR Code r 10 2 r r 12 r 0 + r 11 r 13 MEM(r 12 ) r 14 r 10 x r 13 r r 16 r 0 + r 15 r 17 MEM(r 16 ) r 18 r 17 - r 14 r r 20 r 0 + r 19 MEM(r 20 ) r 18 Simplify LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r

49 EXAMPLE LLIR Code r 13 MEM(r 0 r 14 2 x r 13 r 17 MEM(r 0 r 18 r 17 - r 14 MEM(r 0 r 18 Match SPIM Code lw $t0,@y($sp) mul $t1,$t0,2 lw $t2,@x($sp) sub $t3,$t2,$t1 sw $t3,@w($sp) Introduced all memory operations & temporary names Turned out pretty good code 101 OTHER CONSIDERATIONS Dead values A list of dead values can be constructed during the expansion step Control flow operations Easy approach: clear the window when a control-flow operation or a label is reached Difficult approach: examine the context around branches Logic window Physical window is small: quick but miss some opportunity Alternatively, compiler can consider together definitions and uses of the same value as they are in a logical window

50 INSTRUCTION SCHEDULING INSTRUCTION SCHEDULING Problem: given a fixed set of instructions, how to decide their execution order Primary purpose: minimize execution time Globally picking the best order is NP-complete Possible impacts of changing computation order Hide latency Improve hardware utilization Superscalar and VLIW Reduce register pressure Restrictions Data dependences / control dependences Resource constraints

51 VLIW PROCESSORS VLIW (Very Long Instruction Word) Can issue more than one pipelined instruction per cycle A wide instruction holds several normal instructions, all of which are to be issued at the same time Typically, each corresponds to an operation on a different functional unit Scheduling more complicated Compilers expected to pack the wide instruction correctly and as efficient as possible 106 WHY DOES IT MATTER? Many operations have non-zero latencies Execution time is order-dependent Assumed latencies (conservative) Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 fadd 1 fmult 2 shift 1 branch 0 to 8 Loads & stores may or may not block > Non-blocking fill those issue slots Scheduler should hide the latencies

52 Why does it matter? Schedule 1 1 lw $t0,w 4 add $t0,$t0,$t0 5 lw $t1,x 8 mult $t0,$t0,$t1 9 lw $t1,y 12 mult $t0,$t0,$t1 13 lw $t1,z 16 mult $t0,$t0,$t1 18 sw $t0,w done at time 21 W W * 2 * X * Y * Z Schedule 2 1 lw $t0,w 2 lw $t1,x add $t0,$t0,$t0 5 mult $t0,$t0,$t1 6 lw $t1,y 9 mult $t0,$t0,$t1 10 lw $t1,z 13 mult $t0,$t0,$t1 15 sw $t0,w done at time 18 Schedule 3 1 lw $t0,w 2 lw $t1,x 3 lw $t2,y 4 add $t0,$t0,$t0 5 mult $t0,$t0,$t1 6 lw $t1,z 7 mult $t0,$t0,$t2 9 mult $t0,$t0,$t1 11 sw $t0,w done at time 14 Issue time Schedule 3 requires an extra register 109 SCHEDULING CONSTRAINTS Hardware resource Only a limited number of operations that rely on a particular type of hardware unit can be issued at each cycle Control dependence The execution of S1 depends on the result of S2 Data dependence Constrains that ensure data is produced and consumed in the correct order There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if and only if Both statements access the same memory location and at least one of them stores into it There is a feasible run-time execution path from S1 to S

53 IF ID EX MA WB PIPELINE QUICK REVIEW IF ID EX MA WB Five stages: IF: Instruction fetch ID: Instruction decode and register fetch EX: Execution and effective address calculation MA: Memory access WB: Write back Multiple instructions are overlapped in execution (at different stages) Instruction level parallelism (ILP) Pipeline bubble: the whole pipeline stalls due to structure/data/control hazards 111 IF ID EX MA WB CONTROL HAZARDS Branches often take some number of cycles to complete, creating delay slots Determine taken or not Compute target address if taken A compiler will try to fill these delay slots with valid and useful instructions (rather than nop) May need to examine the context around branches and move operations IF ID EX MA WB IF stall stall IF ID EX MA WB 112 Stall if branch is made 53

54 BRANCH SCHEDULING EXAMPLES Add R1, R2, R3 if R2 = 0 then Delay slot Sub R4, R5, R6 Add R1, R2, R3 if R1 = 0 then Delay slot if R2 = 0 then Add R1, R2, R3 From before op dest, src1, src2 Sub R4, R5, R6 ADD R1, R2, R3 if R1 = 0 then Sub R4, R5, R6 From target 113 STALL CYCLES Some architectures require some number of cycles between a condition instruction and the branch that uses the condition Ex: SPARC requires at least one non-floating point compare between a floating point compare and the branch instruction that uses the result

55 DATA HAZARD EXAMPLE Data hazard due to memory latency: lw R1,0(R2) IF ID EX MA WB op dst, src1, src2 add R3,R1,R4 IF ID stall EX MA WB lw R5,8(R2) IF stall ID EX MA WB Hazard removed by scheduling lw R1,0(R2) IF ID EX MA WB lw R5,8(R2) IF ID EX MA WB add R3,R1,R4 IF ID EX MA WB 115 DATA DEPENDENCIES Flow dependencies - read after write x = 4; y = x + 1 Antidependencies - write after read y = x + 1; x = 4; Output dependencies - write after write x = 4; x = y + 1; The latter two can be eliminated by renaming No value flows Flow dependencies are also called true dependencies

56 Example x = 4 y = 6 p = x + 2 z = y + p x = z y = p x = 4 y = 6 p = x + 2 Flow Output Anti y = p z = y + p x = z 117 DEPENDENCE GRAPH To capture the scheduling constraints of the code, build a dependence/precedence graph G=<N,E> Node n N is an operation with type(n) and delay(n) A directed edge e = (n 1,n 2 ) E if & only if n2 uses the result of n 1 Edge (n 1,n 2 ) can be decorated with delay(n 1 ) Inst a) b) a)ld w R1 b)ld y R2 3 3 c)div R2, R1 R1 d)ld z R2 e)mul R2, R1 R1 f)st R1 x Instruction latency/delay: LD, ST: 3; ADD, SUB: 1; MUL, DIV: 2 c) d) 2 2 e) f) 3 Dependence graphs may have multiple roots

57 INSTRUCTION SCHEDULING A correct schedule S maps each n N into an integer representing its cycle number, and S(n) > 0, for all n N If (n1,n2) E, S(n1 ) + delay(n1 ) S(n2 ) For each type t, there are no more operations of type t in any cycle than the target machine can issue The length of a schedule S, denoted L(S) L(S) = maxn N (S(n) + delay(n)) Common goal of instruction scheduling is to find the shortest possible correct schedule S is time-optimal if L(S) L(S1 ), for all other schedules S1 A schedule might also be optimal in terms of registers, power,. 119 INSTRUCTION SCHEDULING Critical points All operands must be available Multiple operations can be ready Moving operations can change register lifetimes Placing uses near definitions can shorten register lifetimes Together, these issues make scheduling hard (NP-Complete) Block-level scheduling is the simple case Restricted to straight-line code Dominant algorithm: list scheduling Must consider issues at the boundaries

58 LIST SCHEDULING Lots of versions general framework: Renaming for anti-dependences (optional) Build a precedence graph Edges annotated with latency Compute a priority function over the nodes Use list scheduling to construct a schedule, one cycle at a time Use a queue for operations that are ready Initialized to include nodes without predecessors At each cycle, choose a ready operation and schedule it If multiple candidates, choose one based on the priority function Update the ready queue When an operation finishes, check to see whether new operations can be added into the ready queue 121 LOCAL LIST SCHEDULING Cycle 1 Ready leaves of P Active Ø while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op Cycle Cycle + 1 for each op Active if (S(op) + delay(op) Cycle) then remove op from Active for each successor s of op in P if (s is ready) then Ready Ready s Keep track of instructions that could be issued Keep track of instructions currently being executed Removal in priority order schedule the instruction op has completed execution If successor s operands are ready, put it on Ready

59 SCHEDULING EXAMPLE 1. Build the precedence graph a: loadai r1 b: add r1,r1 r1 c: loadai r2 d: mult r1,r2 r1 e: loadai r2 f: mult r1,r2 r1 g: loadai r2 h: mult r1,r2 r1 i: storeai r1 a b d f c h i e g The Code The Precedence Graph 123 SCHEDULING EXAMPLE Build the precedence graph Determine priorities: longest latency-weighted path a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w b a Operation Cycles load 3 store 3 loadi 1 add 1 mult 2 12 c 10 d e 9 8 g 7 f 5 h i 3 The Code The Precedence Graph

60 LIST SCHEDULING Ready = {a,c,e,g} Active = { } time = 1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w The Code 13 a 12 b c d e 9 8 g 7 f 5 h 3 i The Precedence Graph 125 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {c,e,g} Active = {a} time = 2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c 10 d e 8 g 7 f 5 h 3 i The Code The Precedence Graph

61 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {e,g} Active = {a,c} 2 loadai r0,@x => r2 time = 3 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 10 9 c d 7 f 10 e 8 g 5 h 3 i The Code The Precedence Graph 127 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g} ->{b,g} Active = {a,c,e} - >{c,e} time = 4 2 loadai r0,@x => r2 3 loadai r0,@y => r3 Register name changed a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 10 9 c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph

62 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g}->{d,g} Active = {b,c,e}->{e} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 5 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b 9 c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph 129 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {g} Active = {d,e} ->{d} time = 6 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c d 7 f e 5 h 3 i 8 g The Code The Precedence Graph

63 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} ->{f} Active = {d,g} -> {g} time = 7 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b c d 7 f e 5 h 3 i g The Code The Precedence Graph 131 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {f,g} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 8 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d c f e 5 h 3 i g The Code The Precedence Graph

64 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {}->{h} Active = {f,g}->{} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 9 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d c f e 5 h 3 i g The Code The Precedence Graph 133 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {h} 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 time = 10 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 9 mult r1,r2 => r1 c e g f h 3 i The Code The Precedence Graph

65 LIST SCHEDULING Ready = {}->{i} Active = {h}->{} time = 11 1 loadai r0,@w => r1 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 7 mult r1,r3 => r1 9 mult r1,r2 => r1 c e g f h 3 i The Code The Precedence Graph 135 LIST SCHEDULING 1 loadai r0,@w => r1 Ready = {} Active = {i} time = 12 2 loadai r0,@x => r2 3 loadai r0,@y => r3 4 add r1,r1 => r1 5 mult r1,r2 => r1 6 loadai r0,@z => r2 7 mult r1,r3 => r1 a: loadai r0,@w r1 b: add r1,r1 r1 c: loadai r0,@x r2 d: mult r1,r2 r1 e: loadai r0,@y r2 f: mult r1,r2 r1 g: loadai r0,@z r2 h: mult r1,r2 r1 i: storeai r1 r0,@w a b d 9 mult r1,r2 => r1 11 storeai r1 => r1,@w c e g f h 3 i The Code The Precedence Graph

66 SCHEDULING EXAMPLE 1. Build the precedence graph 2. Determine priorities: longest latencyweighted path 3. Perform list scheduling 1) a: l o ad A I r r 1 2) c: l o ad A I r x r 2 3) e: l o ad A I r y r 3 4) b: a d d r 1, r 1 r 1 5) d: m ul t r 1, r 2 r 1 6) g: l o ad A I r r 2 7) f: m ul t r 1, r 3 r 1 9) h: m ul t r 1, r 2 r 1 11) i: s t o r e A I r 1 r The Code New register name used b a 12 c 10 d e 9 8 g 7 f 5 h The Precedence Graph i PRIORITY FUNCTIONS Various ways to decide the rank of a node Longest latency-weighted path Prioritize critical paths The number of immediate successors The total number of descendants Latency A node s rank is higher if it contains the last use of a value Tend to decrease the demand of registers Unfortunately, none dominates the others in terms of overall schedule quality Use of multiple priorities allows tiebreaks

67 FORWARD & BACKWARD SCHEDULING List scheduling breaks down into two distinct classes Forward list scheduling Start with available operations Work forward in time Ready all operands available Neither always win in practice Example in EAC Ch Backward list scheduling Start with no successors Work backward in time Ready latency covers uses A compiler can try several versions of list scheduling and choose the shortest schedule 139 EXAMPLE Forward and backward can produce different results loadi 1 lshift loadi 2 loadi 3 loadi 4 Latency to the cbr add 1 add 2 add 3 add 4 addi Subscript to identify cmp store 1 store 2 store 3 store 4 store 5 1 cbr Block from SPEC benchmark go Operation load loadi add addi store cmp Latency

68 F o r w a r d S c h e d u l e EXAMPLE Int Int Mem 1 loadi 1 lshift 2 loadi 2 loadi 3 3 loadi 4 add 1 4 add 2 add 3 5 add 4 addi store 1 6 cmp store 2 7 store 3 8 store 4 9 store cbr B a c k w a r d S c h e d u l e Int Int Mem 1 loadi 4 2 addi lshift 3 add 4 loadi 3 4 add 3 loadi 2 store 5 5 add 2 loadi 1 store 4 6 add 1 store 3 7 store 2 8 store cmp 12 cbr 13 Using latency to root as the priority 141 SCHEDULING LARGER REGIONS Within a basic block, list scheduling works well Move beyond basic blocks improves the quality of generated code List scheduling forms the basic of most algorithms working on larger regions of code Critical issue is to guarantee that moving operations does not change the externally observable program behaviors, given any possible control flow in the program

69 SUPERLOCAL SCHEDULING Scheduling over EBB Paths through EBBs form straight-line code Treated as if they were single blocks Moving operations across block boundaries must be careful B 2 e f B 1 a b c d B 3 g Two non-trivial paths: {B1,B2,B4 } & {B1,B3 } A compiler can schedule {B1,B2,B4} first, then schedule B3 with B1 as a fixed prefix B 4 h i B 6 l B 5 j k 143 SUPERLOCAL SCHEDULING Having B 1 in both paths {B 1,B 2,B 4 } & {B 1,B 3 } causes conflicts Moving an op out of B 1 to B 2 Forward motion or downward motion Must insert compensation code in B 3 Increases code space B 4 h i B 2 c,e f B 1 B 5 a b c d j k B 3 no c here! g B 6 l add c

70 SUPERLOCAL SCHEDULING Having B 1 in both paths {B 1,B 2,B 4 } & {B 1,B 3 } causes conflicts Moving an op into B 1 Backward motion or upward motion Lengthens {B 1,B 3 } B 2 e f B 1 a b c d,f B 3 This makes the path even longer! undo f g Adds computation to {B 1,B 3 } May also need compensation code B 4 h i B 5 j k Renaming may avoid undo f B 6 l 145 Superlocal Scheduling More aggressive superlocal scheduling Clone blocks to create more context B 1 a b c d B 2 e f B 3 g Join points create blocks that must work in multiple contexts B 4 h i B 5 j k 2 paths B 6 l 3 paths

71 Superlocal Scheduling More aggressive superlocal scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B 4 h i B 2 e f B 1 B 5a j k a b c d B 3 B 5b g j k B 6a l B 6b l B 6c l 147 Superlocal Scheduling More aggressive superlocal scheduling Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5a }, {B 1,B 3,B 5b } Pay heed to compensation code B 2 e f B 1 a b c d B 3 g Works well for forward motion B 4 h i l B 5a j k l B 5b j k l

72 TRACE SCHEDULING Start with execution counts for edges Obtained by profiling Pick the hot path B 1 7 a b c d 10 3 B 2 e f B 3 g B 4 h i B 5 j k 5 5 B 6 l 149 Trace Scheduling Pick the hot path 10 B 1,B 2,B 4,B 6 Schedule it Compensation code in B 3,B 5 if needed If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles B 4 h i B B 6 e f l B 1 7 B a b c d j k B g

73 SCHEDULING LOOPS Loops play a critical role in most computationintensive tasks Main target of compiler optimizations Scheduling can move code around, in particular, find instructions to fill branch delay slots for loops Still, small loops may have too few operations to move and to keep the underlying FU busy Loop scheduling techniques Loop unrolling Software pipelining 151 LOOP UNROLLING Main idea: loop body is replicated several times, and the increment of the loop variable is adjusted to match Effects Loop overhead is reduced Larger loop body: more ILP/ scheduling freedom Register usage within the loop body increased do i=1 to n by 1 a(i) = a(i)*s end do i=1 to n by 2 a(i) = a(i)*s a(i+1) = a(i+1)*s end Additional code needed at the end if n is not even

74 LOOP UNROLLING EXAMPLE do i=1 to n by 1 a(i) = a(i)*s end add r arp, 1 L1: loada r a 2 addi 3 cmp_lt r up r cc 4 mult r a,r s r a 5 stall 6 cbr r cc L1, L2 7 storeao r a (-4) Operation Latency loada 3 storeao 3 loadao 1 add 1 mult 3 cmp/cbr 1 Scheduling already considered load- and branch- latency 7 cycle per iteration 3-inst loop overhead 1 stall due to multiplication latency L2: 153 LOOP UNROLLING EXAMPLE do i=1 to n by 2 a(i) = a(i)*s a(i+1) = a(i+1)*s end add r arp, 1 L1: loada r a 2 loadao r b 3 addi 4 mult r a,r s r a 5 mult r b,r s r b 6 cmp_lt r up r cc 7 storeao r a (-8) 8 cbr r cc L1, L2 9 storeao r b (-4) L2: Operation Latency loada 3 storeao 3 loadao 1 add 1 mult 3 cmp/cbr 1 Scheduling after loop unrolling Note the instruction changes 9/2 = 4.5 cycle per iteration 1.5-inst loop overhead 0 stall Need one more register

75 LOOP UNROLLING Compiler needs to Check whether loop iterations are independent Rename registers to avoid name dependencies Eliminate extra test and branches and adjust loop iteration and termination code Adjust the load/store offset Schedule the code Pros and cons Pros: reduced loop overhead Cons: increased code size; register pressure Difficult to choose the unrolling factor 155 SOFTWARE PIPELINING Inst executed together Idea: combine instructions from different loop iterations to hide the latency and keep all FU busy

76 SOFTWARE PIPELINING Symbolically unroll the loop and select instructions from different iterations No worry about registers No worry about branch controls It0 It1 It2 LD Start-up code MULT LD SD MULT LD SD MULT SD Loop body: Fill in pipeline Finish-up code 158 SOFTWARE PIPELINING EXAMPLE Before: unrolled After: software pipelined loada r a mult r a,r s r a storea r a loadao r b mult r b,r s r b storeao r b loadao r c mult r c,r s r c storeao r c addi cmp_lt r up r cc cbr r cc L1, L2 loada r a mult r a,r s r b loadao r a L1:storeA r b //iter i mult r a,r s r b //iter i+1 loadao r a //iter i+2 addi cmp_lt r up r cc cbr r cc L1, L2 L2:storeA r b mult r a,r s r b storeao r b Value need to be adjusted

77 SOFTWARE PIPELINING VS. UNROLLING Different execution pattern 160 SOFTWARE PIPELINING Software pipelining consumes less code space than loop unrolling Constraints Critical resources Data dependencies do i=1 to n by 1 s = a(i)+ s end Loop iterations are not independent of each other Need dependence analysis to identify

78 SOFTWARE PIPELINING IMPLEMENTATION Unroll and compact Unroll copies of loop body and search for repeating patterns Insert start-up and clean-up code Window scheduling Build a dependence graph for two copies of basic block of the loop Create a window that contains one complete iteration Slide window around looking for best schedule More details: Dragon Ch WINDOW SCHEDULING EXAMPLE swf lwf Prologue swf lwf swf lwf fadd sub Body fadd sub fadd sub swf lwf swf lwf fadd sub Epilogue fadd sub Reference: S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann Publishers,

Material Covered on the Final

Material Covered on the Final Material Covered on the Final On the final exam, you are responsible for: Anything covered in class, except for stories about my good friend Ken Kennedy All lecture material after the midterm ( below the

More information

Instruction Selection

Instruction Selection Compiler Design Instruction Selection Hwansoo Han Structure of a Compiler O(n) O(n) O(n log n) Scanner words Parser IR Analysis & Optimization IR Instruction Selection Either fast or NP-Complete asm asm

More information

Register Allocation. Maryam Siahbani CMPT 379 4/5/2016 1

Register Allocation. Maryam Siahbani CMPT 379 4/5/2016 1 Register Allocation Maryam Siahbani CMPT 379 4/5/2016 1 Register Allocation Intermediate code uses unlimited temporaries Simplifying code generation and optimization Complicates final translation to assembly

More information

Compiling Techniques

Compiling Techniques Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd

More information

Lecture 10: Data Flow Analysis II

Lecture 10: Data Flow Analysis II CS 515 Programming Language and Compilers I Lecture 10: Data Flow Analysis II (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1 Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

Construction of Static Single-Assignment Form

Construction of Static Single-Assignment Form COMP 506 Rice University Spring 2018 Construction of Static Single-Assignment Form Part II source IR IR target code Front End Optimizer Back End code Copyright 2018, Keith D. Cooper & Linda Torczon, all

More information

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class Register Alloca.on CMPT 379: Compilers Instructor: Anoop Sarkar anoopsarkar.github.io/compilers-class 1 Register Alloca.on Intermediate code uses unlimited temporaries Simplifying code genera.on and op.miza.on

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. Scheduling: Exact Methods

ECE 5775 (Fall 17) High-Level Digital Design Automation. Scheduling: Exact Methods ECE 5775 (Fall 17) High-Level Digital Design Automation Scheduling: Exact Methods Announcements Sign up for the first student-led discussions today One slot remaining Presenters for the 1st session will

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

Dataflow Analysis. A sample program int fib10(void) { int n = 10; int older = 0; int old = 1; Simple Constant Propagation

Dataflow Analysis. A sample program int fib10(void) { int n = 10; int older = 0; int old = 1; Simple Constant Propagation -74 Lecture 2 Dataflow Analysis Basic Blocks Related Optimizations SSA Copyright Seth Copen Goldstein 200-8 Dataflow Analysis Last time we looked at code transformations Constant propagation Copy propagation

More information

Lecture 11: Data Flow Analysis III

Lecture 11: Data Flow Analysis III CS 515 Programming Language and Compilers I Lecture 11: Data Flow Analysis III (The lectures are based on the slides copyrighted by Keith Cooper and Linda Torczon from Rice University.) Zheng (Eddy) Zhang

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work Lab 5 : Linking Name: Sign the following statement: On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work 1 Objective The main objective of this lab is to experiment

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

CA Compiler Construction

CA Compiler Construction CA4003 - Compiler Construction Code Generation to MIPS David Sinclair Code Generation The generation o machine code depends heavily on: the intermediate representation;and the target processor. We are

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1 CSE P 501 Compilers Value Numbering & Op;miza;ons Hal Perkins Winter 2016 UW CSE P 501 Winter 2016 S-1 Agenda Op;miza;on (Review) Goals Scope: local, superlocal, regional, global (intraprocedural), interprocedural

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2. COMPUTER SCIENCE S E D G E W I C K / W A Y N E PA R T I I : A L G O R I T H M S, T H E O R Y, A N D M A C H I N E S Computer Science Computer Science An Interdisciplinary Approach Section 4.2 ROBERT SEDGEWICK

More information

Computer Architecture ELEC2401 & ELEC3441

Computer Architecture ELEC2401 & ELEC3441 Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

Principles of AI Planning

Principles of AI Planning Principles of 5. Planning as search: progression and regression Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg May 4th, 2010 Planning as (classical) search Introduction Classification

More information

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Finite State Machines Introduction Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Such devices form

More information

Saturday, April 23, Dependence Analysis

Saturday, April 23, Dependence Analysis Dependence Analysis Motivating question Can the loops on the right be run in parallel? i.e., can different processors run different iterations in parallel? What needs to be true for a loop to be parallelizable?

More information

CSC D70: Compiler Optimization Static Single Assignment (SSA)

CSC D70: Compiler Optimization Static Single Assignment (SSA) CSC D70: Compiler Optimization Static Single Assignment (SSA) Prof. Gennady Pekhimenko University of Toronto Winter 08 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip

More information

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle Computer Engineering Department CC 311- Computer Architecture Chapter 4 The Processor: Datapath and Control Single Cycle Introduction The 5 classic components of a computer Processor Input Control Memory

More information

Scalar Optimisation Part 2

Scalar Optimisation Part 2 Scalar Optimisation Part 2 Michael O Boyle January 2014 1 Course Structure L1 Introduction and Recap 4-5 lectures on classical optimisation 2 lectures on scalar optimisation Last lecture on redundant expressions

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University Algorithms NP -Complete Problems Dong Kyue Kim Hanyang University dqkim@hanyang.ac.kr The Class P Definition 13.2 Polynomially bounded An algorithm is said to be polynomially bounded if its worst-case

More information

Processor Design & ALU Design

Processor Design & ALU Design 3/8/2 Processor Design A. Sahu CSE, IIT Guwahati Please be updated with http://jatinga.iitg.ernet.in/~asahu/c22/ Outline Components of CPU Register, Multiplexor, Decoder, / Adder, substractor, Varity of

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation 8 Dependence Analysis Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

CS 52 Computer rchitecture and Engineering Lecture 4 - Pipelining Krste sanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs52!

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

IE418 Integer Programming

IE418 Integer Programming IE418: Integer Programming Department of Industrial and Systems Engineering Lehigh University 2nd February 2005 Boring Stuff Extra Linux Class: 8AM 11AM, Wednesday February 9. Room??? Accounts and Passwords

More information

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Marwan Burelle.  Parallel and Concurrent Programming. Introduction and Foundation and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions

More information

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam Compilers Lecture 3 Lexical analysis Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Big picture Source code Front End IR Back End Machine code Errors Front end responsibilities Check

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Principles of AI Planning

Principles of AI Planning Principles of AI Planning 5. Planning as search: progression and regression Albert-Ludwigs-Universität Freiburg Bernhard Nebel and Robert Mattmüller October 30th, 2013 Introduction Classification Planning

More information

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session) GATE 4 A Brief Analysis (Based on student test experiences in the stream of CS on st March, 4 - Second Session) Section wise analysis of the paper Mark Marks Total No of Questions Engineering Mathematics

More information

CSE 105 THEORY OF COMPUTATION

CSE 105 THEORY OF COMPUTATION CSE 105 THEORY OF COMPUTATION Spring 2016 http://cseweb.ucsd.edu/classes/sp16/cse105-ab/ Today's learning goals Sipser Ch 3.3, 4.1 State and use the Church-Turing thesis. Give examples of decidable problems.

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

Task Assignment. Consider this very small instance: t1 t2 t3 t4 t5 p p p p p

Task Assignment. Consider this very small instance: t1 t2 t3 t4 t5 p p p p p Task Assignment Task Assignment The Task Assignment problem starts with n persons and n tasks, and a known cost for each person/task combination. The goal is to assign each person to an unique task so

More information

Automata Theory CS S-12 Turing Machine Modifications

Automata Theory CS S-12 Turing Machine Modifications Automata Theory CS411-2015S-12 Turing Machine Modifications David Galles Department of Computer Science University of San Francisco 12-0: Extending Turing Machines When we added a stack to NFA to get a

More information

EE382V: System-on-a-Chip (SoC) Design

EE382V: System-on-a-Chip (SoC) Design EE82V: SystemonChip (SoC) Design Lecture EE82V: SystemonaChip (SoC) Design Lecture Operation Scheduling Source: G. De Micheli, Integrated Systems Center, EPFL Synthesis and Optimization of Digital Circuits,

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

ENEE350 Lecture Notes-Weeks 14 and 15

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems

More information

EE382V-ICS: System-on-a-Chip (SoC) Design

EE382V-ICS: System-on-a-Chip (SoC) Design EE8VICS: SystemonChip (SoC) EE8VICS: SystemonaChip (SoC) Scheduling Source: G. De Micheli, Integrated Systems Center, EPFL Synthesis and Optimization of Digital Circuits, McGraw Hill, 00. Additional sources:

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system

More information

What is SSA? each assignment to a variable is given a unique name all of the uses reached by that assignment are renamed

What is SSA? each assignment to a variable is given a unique name all of the uses reached by that assignment are renamed Another Form of Data-Flow Analysis Propagation of values for a variable reference, where is the value produced? for a variable definition, where is the value consumed? Possible answers reaching definitions,

More information

Introduction to Theory of Computing

Introduction to Theory of Computing CSCI 2670, Fall 2012 Introduction to Theory of Computing Department of Computer Science University of Georgia Athens, GA 30602 Instructor: Liming Cai www.cs.uga.edu/ cai 0 Lecture Note 3 Context-Free Languages

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

CISC4090: Theory of Computation

CISC4090: Theory of Computation CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits Chris Calabro January 13, 2016 1 RAM model There are many possible, roughly equivalent RAM models. Below we will define one in the fashion

More information

CSE 417. Chapter 4: Greedy Algorithms. Many Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

CSE 417. Chapter 4: Greedy Algorithms. Many Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. CSE 417 Chapter 4: Greedy Algorithms Many Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Greed is good. Greed is right. Greed works. Greed clarifies, cuts through,

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

Equalities and Uninterpreted Functions. Chapter 3. Decision Procedures. An Algorithmic Point of View. Revision 1.0

Equalities and Uninterpreted Functions. Chapter 3. Decision Procedures. An Algorithmic Point of View. Revision 1.0 Equalities and Uninterpreted Functions Chapter 3 Decision Procedures An Algorithmic Point of View D.Kroening O.Strichman Revision 1.0 Outline Decision Procedures Equalities and Uninterpreted Functions

More information

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD D DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD Yashasvini Sharma 1 Abstract The process scheduling, is one of the most important tasks

More information

Finite Automata and Formal Languages

Finite Automata and Formal Languages Finite Automata and Formal Languages TMV26/DIT32 LP4 2 Lecture 6 April 5th 2 Regular expressions (RE) are an algebraic way to denote languages. Given a RE R, it defines the language L(R). Actually, they

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate

More information

Lexical Analysis Part II: Constructing a Scanner from Regular Expressions

Lexical Analysis Part II: Constructing a Scanner from Regular Expressions Lexical Analysis Part II: Constructing a Scanner from Regular Expressions CS434 Spring 2005 Department of Computer Science University of Alabama Joel Jones Copyright 2003, Keith D. Cooper, Ken Kennedy

More information

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M 8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD

More information

Dataflow Analysis Lecture 2. Simple Constant Propagation. A sample program int fib10(void) {

Dataflow Analysis Lecture 2. Simple Constant Propagation. A sample program int fib10(void) { -4 Lecture Dataflow Analysis Basic Blocks Related Optimizations Copyright Seth Copen Goldstein 00 Dataflow Analysis Last time we looked at code transformations Constant propagation Copy propagation Common

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.

More information

Clock-driven scheduling

Clock-driven scheduling Clock-driven scheduling Also known as static or off-line scheduling Michal Sojka Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Control Engineering November 8, 2017

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

Mathmatics 239 solutions to Homework for Chapter 2

Mathmatics 239 solutions to Homework for Chapter 2 Mathmatics 239 solutions to Homework for Chapter 2 Old version of 8.5 My compact disc player has space for 5 CDs; there are five trays numbered 1 through 5 into which I load the CDs. I own 100 CDs. a)

More information