This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Size: px

Start display at page:

Download "This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example"

Joan Rose
6 years ago
Views:

1 This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:! Pipelining! Multiple stages! Different instructions each stage! Superscalar! Multiple instructions in each stage! N-wide! Now:! Compiler (static) scheduling! Hardware (dnamic) scheduling CIS 50: Scheduling Readings! H+P! None (not happ with explanation of this topic)! Papers! Alpha 6! Due toda! Discussion! Alpha 6! Due next week Review Example loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop A B F D X M W F D d* X M W F p* D X M W F D X M W F D X M W F D X M W IPC = 6/7 = 0.86 On a single-issue, 5-stage pipeline, How man ccles does each loop iteration take? Assume all cache hits and perfect branch prediction CIS 50: Scheduling CIS 50: Scheduling

2 Review Example Un-optimized code loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop A B F D X M W F D d* F p* d* X M W p* D X M W F p* p* D X M W What if the pipeline is -wide? F D X M W F d* D X M W IPC = 6/6 =.0 6% speedup Would we get an more performance b going -wide? CIS 50: Scheduling 5 do { *p = *p + x; p++; } while (p!= &a[n]); loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop Compiler has taken the code and just emitted assembl in the order statements appear Can we do better? CIS 50: Scheduling 6 Scheduling Code Dataflow graph! Compiler can re-order instructions! Eliminate RAW stalls! Place independent instructions near each other! Called static scheduling! Must be careful to preserve program behavior in all cases! ld [r] ->r add r + r -> r addi r + -> r st r -> [r] sub r, -> r6 jne r6, loop CIS 50: Scheduling 7 CIS 50: Scheduling 8

3 Optimization A problem with that ld [r] ->r addi r + -> r ld [r] ->r addi r + -> r Now st r->[r] gets wrong value! add r + r -> r sub r, -> r6 st r -> [r] jne r6, loop add r + r -> r sub r, -> r6 st r -> [r] jne r6, loop CIS 50: Scheduling 9 CIS 50: Scheduling 0 Fixed Optimized code ld [r] ->r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jne r6, loop loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> [r] jnz r6, loop do { *p = *p + x; p++; } while (p!= &a[n]); Now pieces of each statement are interleaved. (Aside: wh debugging optimized code is confusing) CIS 50: Scheduling CIS 50: Scheduling

4 How fast now? How fast now? loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jnz r6, loop A B F D X M W F D X M W F D X M W F D X M W How fast is this on a -wide machine? F D X M W F D X M W IPC = 6/6 =.0 as fast as -wide unoptimized loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jnz r6, loop A B F D X M W F D X M W F D d* X M W F D p* X M W How fast is this on a -wide machine? F F p* D X M W p* D X M W IPC = 6/ =.5 50% speedup What about -wide? 8? 6? CIS 50: Scheduling CIS 50: Scheduling Performance vs width Room to improve? Millions of ccles for M iterations Unoptimized Scheduled! Code is much better! -wide performance greatl improved! -wide now useful! Can we do better?! With the scheduling scope shown: no! Larger scheduling scope: es Machine Width CIS 50: Scheduling 5 CIS 50: Scheduling 6

5 Scheduling Scope! Window of instructions we can re-order in! Larger => better schedules! Compiler: theoreticall whole program! Not practical for man reasons! How?! One wa: Loop un-rolling! Others exist: not the scope of this class Loop un-rolling! Take (or more) iterations! Remove extra loop control! Getting rid of extra instructions saves time!! Note: must compare ccles, not IPC now! Re-schedule both together! Larger scope to schedule from! Register names ma need changing CIS 50: Scheduling 7 CIS 50: Scheduling 8 Loop unrolling Performance vs width ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + 8 -> r sub r, -> r6 jnz r6, loop ld [r] -> r addi r + 8 -> r ld -[r] -> r7 sub r, -> r6 add r + r -> r add r7 + r -> r7 st r -> -8[r] st r7 -> -[r] jnz r6, loop Millions of ccles for M iterations Machine Width Unoptimized Scheduled Unrolled x Unrolled x Unrolled 8x CIS 50: Scheduling 9 CIS 50: Scheduling 0

6 Wh not unroll K times?! More unrolling => more performance! Fewer dnamic instructions! Better scheduling! Downsides / limiting factors?! Number of registers! More static instructions => $I pressure Limitations of static scheduling! Assumes cache hits! Common case! Miss? Different schedule mabe better! Compiler must be conservative! Needs to guarantee correctness! Sometimes tough to tell if re-ordering is legal CIS 50: Scheduling CIS 50: Scheduling Re-ordering barrier: branches Re-ordering barrier: ld/st loop: jz r, not_found ld [r] -> r sub r, r -> r jz r, found ld [r] -> r jmp loop Legal to move load up? No: if r is null, will cause a fault ld [r] -> r st r -> [r] Can these be switched? No: r and r ma be same value! Technical term: alias! Two names for same memor location CIS 50: Scheduling CIS 50: Scheduling

7 An example void f(int * a, int *b, int *c, int N) { } for (int i =0; i < N; i++) { a[i] = b[i] + c[i]; } ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] Loop unrolled x ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // loop control here What can we re-order here? CIS 50: Scheduling 5 CIS 50: Scheduling 6 Loop unrolled x Aliasing problems ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // add 8 to r, r, and r Can we move this load up? No: r+ might equal r! Must be conservative! f (ptr+, ptr, ptr) not common case! but is possible! If onl we could speculate.! Allow re-ordering in the common case! Get correctness in the rare case! Anthing software can do, hardware can do better.. CIS 50: Scheduling 7 CIS 50: Scheduling 8

8 Out-of-order execution! Hardware can speculate! Load/store ordering! Branches! D$ misses?! Compiler: no idea! Hardware: knows when the happen! Out-of-order execution! Aka dnamic scheduling Out-of-order execution! Execute out of program order! Execute oldest read instruction! Read: all input values available! Reduce RAW stalls! Retain appearance of in-order! Maintain correctness CIS 50: Scheduling 9 CIS 50: Scheduling 0 Out-of-order pipeline Register renaming Fetch Decode Rename Dispatch Buffer of instructions Issue Reg-read Execute Writeback Commit! Recall static scheduling: xor r ^ r -> r sub - r -> r addi r + -> r! sub/add can be re-ordered! Must change register of sub xor r ^ r -> r sub - r -> r7 addi r7 + -> r In-order front end Out-of-order execution CIS 50: Scheduling CIS 50: Scheduling

9 Register renaming! Same principle applies to hardware! Might re-order anthing! Create unique names! Logical registers => phsical registers! : holds translation! Indexed b logical register! Holds phsical register numbers Register renaming steps! Read input numbers from map table! Allocate new phsical register! None available? => stall! Update map table with destination reg CIS 50: Scheduling CIS 50: Scheduling Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> r p r p r p r p r p r p r p r p p0 p0 CIS 50: Scheduling 5 CIS 50: Scheduling 6

10 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> r r r r p p p p p0 r r r r p p p p0 CIS 50: Scheduling 7 CIS 50: Scheduling 8 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> r p r p r r r p p p0 r r r p p p0 CIS 50: Scheduling 9 CIS 50: Scheduling 0

11 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> r p r p r p r p r r p0 r r p0 CIS 50: Scheduling CIS 50: Scheduling Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> r p r p r p r p r r p0 r r p0 CIS 50: Scheduling CIS 50: Scheduling

12 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> r r r r p p p0 r r r r p p p0 CIS 50: Scheduling 5 CIS 50: Scheduling 6 Renaming example Out-of-order pipeline xor r ^ r -> r sub - r -> r addi r + -> r r r r r p xor p ^ p -> add + p -> sub - p -> addi + -> p0 CIS 50: Scheduling 7 Buffer of instructions Fetch Decode Rename Dispatch Issue Reg-read Execute Writeback Commit Have unique register names Now put into ooo execution structures CIS 50: Scheduling 8

13 Dispatch! Renamed instructions into ooo structures! Re-order buffer (ROB)! All instruction until commit! Issue Queue! Un-executed instructions! Central piece of scheduling logic! Content Addressable Memor (CAM) RAM vs CAM! Random Access Memor! Read/write specific index! Get/set value there! Content Addressable Memor! Search for a value! Find matching indices! One structure can have ports of both tpes CIS 50: Scheduling 9 CIS 50: Scheduling 50 RAM vs CAM: RAM RAM vs CAM: CAM 7 7 Index 0 Read index Find Index 9 9 RAM: read/write specific index CIS 50: Scheduling 5 CAM: search for value CIS 50: Scheduling 5

14 Issue Queue! Holds un-executed instructions! Tracks read inputs! Phsical register names + read bit! AND to tell if read Dispatch Steps! Allocate IQ slot! Full? Stall! Read read bits of inputs! Table -bit per preg! Clear read bit of output in table! Instruction has not produced value et! Write data in IQ slot Insn Inp R Inp R Dst Age Read? CIS 50: Scheduling 5 CIS 50: Scheduling 5 Dispatch Example Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age Read bits CIS 50: Scheduling 55 p p p p xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 Read bits CIS 50: Scheduling 56 p p p p n

15 Dispatch Example Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p Read bits CIS 50: Scheduling 57 p p p p n n xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p sub p Read bits CIS 50: Scheduling 58 p p p p n n n Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p sub p addi n --- Read bits CIS 50: Scheduling 59 p p p p n n n n Out-of-order pipeline! Execution (ooo) stages! Select read instructions! Send for execution! Wakeup dependents Issue Reg-read Execute Writeback CIS 50: Scheduling 60

16 Issue = Select + Wakeup Issue = Select + Wakeup! Select N oldest, read instructions Insn Inp R Inp R Dst Age xor p p 0 add n p sub p addi n ---!! N? xor!! N >=? xor and sub! Note: ma have resource constraints: i.e. ld/st/fp Read! Read!! Wakeup dependent instructions! CAM search for Dst in inputs! Set read! Also update read-bit table for future instructions Insn Inp R Inp R Dst Age xor p p 0 add p sub p addi --- CIS 50: Scheduling 6 CIS 50: Scheduling 6 Issue Register Read!! Select/Wakeup one ccle Dependents go back to back! Now add/addi are read: Insn Inp R Inp R Dst Age! Not done at decode! Must read phsical register (renamed)! Must be done when value read! Or gone thru when expecting bpass! Phsical register file ma be large! Multi-ccle read add p addi --- CIS 50: Scheduling 6 CIS 50: Scheduling 6

17 Renaming review Dispatch Review Everone rename this instruction: mul r * -> r r p r p r p r p p0 CIS 50: Scheduling 65 Everone dispatch this instruction: div / -> p Insn Inp R Inp R Dst Age Read bits p p p p n CIS 50: Scheduling 66 Select Review Wakeup Review Insn Inp R Inp R Dst Age add p p p 0 mul p n p div p n xor p p Determine which instructions are read. Which will be issued on a -wide machine? Which will be issued on a -wide machine? CIS 50: Scheduling 67 Insn Inp R Inp R Dst Age add p p p 0 mul p n p div p n xor p p What information will change if we issue the add? CIS 50: Scheduling 68

18 OOO execution (-wide) OOO execution (-wide) xor RDY add sub RDY addi p 7 p p p add RDY addi RDY xor p^ p -> sub - p -> p 7 p p p CIS 50: Scheduling 69 CIS 50: Scheduling 70 OOO execution (-wide) OOO execution (-wide) add +p -> addi + -> p 7 p p p xor 7^ -> sub 6 - -> p 7 p p p add _ + 9 -> addi _ + -> -> -> CIS 50: Scheduling 7 CIS 50: Scheduling 7

19 OOO execution (-wide) OOO execution (-wide) p 7 p p p > -> p 7 p p p 9 6 CIS 50: Scheduling 7 CIS 50: Scheduling 7 OOO execution (-wide) Multi-ccle operations Note similarit to in-order p 7 p p p 9 6! Multi-ccle ops (ld, fp, mul, etc)! Wakeup defered a few ccles! Structural hazzard?! Cache misses?! Speculative wake-up (assume hit)! Cancel exec of dependents! Re-issue later! Details: complex, not important CIS 50: Scheduling 75 CIS 50: Scheduling 76

20 Re-order Buffer (ROB)! All instructions in order! Purposes! Misprediction recover! In-order commit! Maintain appearance of in-order execution! Freeing of phsical registers Renaming revisited! Overwritten register! Freed at commit! Restore in map table on recover!! Also must be read at rename CIS 50: Scheduling 77 CIS 50: Scheduling 78 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> [ p ] r p r p r p r p r p r p r p r p p0 p0 CIS 50: Scheduling 79 CIS 50: Scheduling 80

21 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> [ p ] [ p ] r p r p r r r p p p0 r r r p p p0 CIS 50: Scheduling 8 CIS 50: Scheduling 8 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> [ p ] [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> [ p ] [ p ] [ ] r p r p r p r p r r p0 r r p0 CIS 50: Scheduling 8 CIS 50: Scheduling 8

22 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> [ p ] [ p ] [ ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] r p r p r p r p r r r r p0 p0 CIS 50: Scheduling 85 CIS 50: Scheduling 86 Renaming example ROB xor r ^ r -> r sub - r -> r addi r + -> r r r p xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ]! ROB entr holds all info for recover/commit! Logical register names! Phsical register names! Instruction tpes! Dispatch: insert at tail! Full? Stall! Commit: remove from head! Not completed? Stall r r p0 CIS 50: Scheduling 87 CIS 50: Scheduling 88

23 Recover Recover example! Completel remove wrong path instructions! Flush from IQ! Remove from ROB! Restore map table to before misprediction! Free destination registers bnz r loop xor r ^ r -> r sub - r -> r addi r + -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> addi + -> [ ] [ p ] [ p ] [ ] [ p ] CIS 50: Scheduling 89 r r p r r p0 CIS 50: Scheduling 90 Recover example Recover example bnz r loop xor r ^ r -> r sub - r -> r addi r + -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> addi + -> [ ] [ p ] [ p ] [ ] [ p ] bnz r loop xor r ^ r -> r sub - r -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> [ ] [ p ] [ p ] [ ] r p r p r r p0 CIS 50: Scheduling 9 r p r p r r p0 CIS 50: Scheduling 9

24 Recover example Recover example bnz r loop xor r ^ r -> r bnz p, loop xor p ^ p -> add + p -> [ ] [ p ] [ p ] bnz r loop xor r ^ r -> r bnz p, loop xor p ^ p -> [ ] [ p ] r p r p r r p p0 CIS 50: Scheduling 9 r r r r p p p p p0 CIS 50: Scheduling 9 Recover example bnz r loop bnz p, loop [ ] What about stores! Stores: Write D$, not registers! Can we rename memor?! Recover in the cache? r r r r p p p p p0 CIS 50: Scheduling 95 CIS 50: Scheduling 96

25 What about stores! Stores: Write D$, not registers! Can we rename memor?! Recover in the cache?!! No (at least not easil)! Cache writes unrecoverable! Stores: onl when certain! Commit Commit xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + ->! Commit: instruction becomes architected state! In-order, onl when instructions are finished! Free overwritten register (wh?) [ p ] [ p ] [ ] [ p ] CIS 50: Scheduling 97 CIS 50: Scheduling 98 Freeing over-written register Commit Example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ]! P was r before xor! P6 is r after xor! Anthing older than xor should read p! Anthing ounger than xor should (until next r writing instruction! At commit of xor, no older instructions exist CIS 50: Scheduling 99 r p0 r p r r CIS 50: Scheduling 00

26 Commit Example Commit Example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] sub - r -> r addi r + -> r add + p -> sub - p -> addi + -> [ p ] [ ] [ p ] r p0 r p p r r CIS 50: Scheduling 0 r p0 r p p r p r CIS 50: Scheduling 0 Commit Example Commit Example sub - r -> r addi r + -> r sub - p -> addi + -> [ ] [ p ] addi r + -> r addi + -> [ p ] r p0 r p p r p r CIS 50: Scheduling 0 r p0 r p p r p r p CIS 50: Scheduling 0

27 Out of order pipeline diagrams! Standard stle: large and cumbersome! Change laout slightl! Columns = stages (dispatch, issue, etc)! Rows = instructions! Content of boxes = ccles! For our purposes: issue/exec = ccle! Ignore preg read latenc, etc! Load-use, mul, div, and FP longer Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> -wide Infinite ROB, IQ, Pregs Loads: ccles CIS 50: Scheduling 05 CIS 50: Scheduling 06 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> Ccle :! Dispatch ld and add CIS 50: Scheduling 07 Ccle :! Dispatch xor and ld! st Ld issues -- also note WB ccle while ou do this (Note: don t issue if WB ports full) CIS 50: Scheduling 08

28 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> 6 Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> Ccle :! add and xor are not read! nd load is- issue it Ccle :! Nothing Ccle 5:! Add can issue CIS 50: Scheduling 09 CIS 50: Scheduling 0 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Ccle 6:! st load can commit (oldest instruction and finished)! xor can issue 6 Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Ccle 7:! Add can commit CIS 50: Scheduling CIS 50: Scheduling

29 Out of order pipeline diagrams Loads and stores Instruction Disp Issue WB Commit Ld [p] -> p 5 6 add p + p -> p xor p ^ -> ld [] -> 6 8 Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Ccle 8:! Commit xor and ld (-wide: can do both at once) CIS 50: Scheduling Ccle :! Can ld [ ] -> execute?! Wh or wh not? CIS 50: Scheduling Loads and stores Loads and stores Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Aliasing (again)!?!? CIS 50: Scheduling 5 Suppose and!= 7 Can ld execute now? CIS 50: Scheduling 6

30 Forwarding Forwarding: Store Queue! Stores write cache at commit! Commit is in-order, delaed b all instructions! Loads read cache! But execution is critical! Forwarding! Allow store -> load communication before store commit! Conceptuall like bpassing, but ver different implementation! Store Queue! Holds all in-flight stores! CAM: searchable b address! Age logic: determine oungest matching store older than load! Store execution! Write Store Queue! Address + Data! Load execution! Search SQ! Match? Forward! Read D$ address data in data out load position Store Queue (SQ) address age D$/TLB value head tail CIS 50: Scheduling 7 CIS 50: Scheduling 8 Load scheduling Our example from before! Store->Load Forwarding:! Get value from executed (but not comitted) store to load! Load Scheduling:! Determine when load can execute with regard to older stores! Conservative load scheduling:! All older stores have executed! Some architectures: split store address / store data! Onl require known address! Advantage: alwas safe! Disadvantage: performance (limits out-of-orderness) ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // loop control here With conservative load scheduling, what can go out of order? CIS 50: Scheduling 9 CIS 50: Scheduling 0

31 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 Suppose wide, conservative scheduling. Ma issue load per ccle. Loads take ccles to complete. CIS 50: Scheduling CIS 50: Scheduling Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 Conservative load scheduling: can t issue ld[p]-> CIS 50: Scheduling CIS 50: Scheduling

32 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 5 CIS 50: Scheduling 6 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 7 CIS 50: Scheduling 8

33 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 9 CIS 50: Scheduling 0 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit Our -wide ooo processor ma as well be -wide in-order! CIS 50: Scheduling CIS 50: Scheduling

34 Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 7! It would be nice if we could issue ld [p]-> in c.! Can we speculate and issue it then? Load Speculation! Speculation requires two things..! Detection of mis-speculations! How can we do this?! Recover from mis-speculations! Squash from offending load! Saw how to squash from branches: same method CIS 50: Scheduling CIS 50: Scheduling Load Queue Store Queue + Load Queue! Detects load ordering violations! Load execution: Write address into LQ! Also note an store forwarded from! Store execution: Search LQ! Younger load with same addr?! Didn t forward from ounger store? store position flush? load queue (LQ) address head tail age D$/TLB SQ head tail! Store Queue: handles forwarding! Written b stores! Searched b loads! Load Queue: detects ordering violations! Written b loads! Searched b stores! Both together! Allows aggressive load scheduling! Stores don t constrain load execution CIS 50: Scheduling 5 CIS 50: Scheduling 6

35 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p]! Aggressive load scheduling?! Issue ld [p]-> in ccle Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 7 CIS 50: Scheduling 8 Our example from before Aggressive Load scheduling ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ! Allows loads to issue before older stores! Increases out-of-orderness +! When no conflict, increases performance -! Conflict => squash => worse performance than waiting! Some loads might forward from stores! Alwas aggressive will squash a lot! Can we have our cake AND eat it too? Saves ccles over conservative Actuall uses ooo-ness CIS 50: Scheduling 9 CIS 50: Scheduling 0

36 Predictive Load scheduling! Predict which loads must wait for stores! Fool me once, shame on ou-- fool me twice?! Loads default to aggressive! Keep table of load PCs that have been caused squashes! Schedule these conservativel +! Simple predictor -! Makes bad loads wait for all older stores is not so great! More complex predictors used in practice! Predict which stores loads should wait for Out of Order: Window Size! Scheduling scope = ooo window size! Larger = better! Constrained b phsical registers! ROB roughl limited b #preg = ROB size + #logical registers! Big register file = hard/slow! Constrained b issue queue! Limits number of un-executed instructions! CAM = can t make big (power + area)! Constrained b load + store queues! Limit number of loads/stores! CAMs CIS 50: Scheduling CIS 50: Scheduling OOO scalabilit research! Checkpoint Processing and Recover [Akkar 0]! Attacks scaling of register file! Take checkpoints at rename! Onl recover to those! Free Pregs aggressivel! Continual Flow Pipelines [Srinivasan 0]! Attacks scaling of Issue Queue! Put L misses and dependents out of IQ! Place back in when L miss returns! Store Vulnerabilit Window [Roth 05] + Store Queue Index Prediction [Sha 05]! Scalable (non-associative) load/store queues! Predict store queue index for forwarding! Filtered load re-execution prior to commit Out of Order: Benefits! Allows speculative re-ordering! Loads / stores! Branch prediction! Schedule can change due to cache misses! Different schedule optimal from on cache hit! Done b hardware! Compiler ma want different schedule for different hw configs! Hardware has onl its own configuration to deal with CIS 50: Scheduling CIS 50: Scheduling

37 Dependence tpes! RAW (Read After Write) = true dependence ld [r] -> r add r + r -> r! WAW (Write After Write) = output dependence ld [r] -> r add r + r -> r! WAR (Write After Read) = anti-dependence ld [r] -> r add r + r -> r Memor dependences! RAW (Read After Write) st r -> [r] ld [r] -> r! WAW (Write After Write) st r -> [r] st r -> [r]! WAR (Write After Read) ld [r] -> r st r -> [r] CIS 50: Scheduling 5 CIS 50: Scheduling 6 More on dependences! RAW! When more than one applies, RAW dominates: add r + r -> r addi r + -> r! Must be respected: no trick to avoid! WAR/WAW on registers! Two things happen to use same name! Can be eliminated b renaming! WAR/WAW on memor! Can t rename memor! Need to use other tricks (later this lecture) CIS 50: Scheduling 7 Out of Order: Top 5 things to know! Register renaming! How to perform is and how to recover it! Commit! Precise state (ROB)! How/when registers are freed! Issue/Select! Wakeup: CAM! Choose N oldest read instructions! Stores! Write at commit! Forward to loads via LQ! Loads! Conservative/aggressive/predictive scheduling! Violation detection CIS 50: Scheduling 8

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode