This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Size: px
Start display at page:

Download "This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example"

Transcription

1 This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:! Pipelining! Multiple stages! Different instructions each stage! Superscalar! Multiple instructions in each stage! N-wide! Now:! Compiler (static) scheduling! Hardware (dnamic) scheduling CIS 50: Scheduling Readings! H+P! None (not happ with explanation of this topic)! Papers! Alpha 6! Due toda! Discussion! Alpha 6! Due next week Review Example loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop A B F D X M W F D d* X M W F p* D X M W F D X M W F D X M W F D X M W IPC = 6/7 = 0.86 On a single-issue, 5-stage pipeline, How man ccles does each loop iteration take? Assume all cache hits and perfect branch prediction CIS 50: Scheduling CIS 50: Scheduling

2 Review Example Un-optimized code loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop A B F D X M W F D d* F p* d* X M W p* D X M W F p* p* D X M W What if the pipeline is -wide? F D X M W F d* D X M W IPC = 6/6 =.0 6% speedup Would we get an more performance b going -wide? CIS 50: Scheduling 5 do { *p = *p + x; p++; } while (p!= &a[n]); loop: ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop Compiler has taken the code and just emitted assembl in the order statements appear Can we do better? CIS 50: Scheduling 6 Scheduling Code Dataflow graph! Compiler can re-order instructions! Eliminate RAW stalls! Place independent instructions near each other! Called static scheduling! Must be careful to preserve program behavior in all cases! ld [r] ->r add r + r -> r addi r + -> r st r -> [r] sub r, -> r6 jne r6, loop CIS 50: Scheduling 7 CIS 50: Scheduling 8

3 Optimization A problem with that ld [r] ->r addi r + -> r ld [r] ->r addi r + -> r Now st r->[r] gets wrong value! add r + r -> r sub r, -> r6 st r -> [r] jne r6, loop add r + r -> r sub r, -> r6 st r -> [r] jne r6, loop CIS 50: Scheduling 9 CIS 50: Scheduling 0 Fixed Optimized code ld [r] ->r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jne r6, loop loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> [r] jnz r6, loop do { *p = *p + x; p++; } while (p!= &a[n]); Now pieces of each statement are interleaved. (Aside: wh debugging optimized code is confusing) CIS 50: Scheduling CIS 50: Scheduling

4 How fast now? How fast now? loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jnz r6, loop A B F D X M W F D X M W F D X M W F D X M W How fast is this on a -wide machine? F D X M W F D X M W IPC = 6/6 =.0 as fast as -wide unoptimized loop: ld [r] -> r addi r + -> r add r + r -> r sub r, -> r6 st r -> -[r] jnz r6, loop A B F D X M W F D X M W F D d* X M W F D p* X M W How fast is this on a -wide machine? F F p* D X M W p* D X M W IPC = 6/ =.5 50% speedup What about -wide? 8? 6? CIS 50: Scheduling CIS 50: Scheduling Performance vs width Room to improve? Millions of ccles for M iterations Unoptimized Scheduled! Code is much better! -wide performance greatl improved! -wide now useful! Can we do better?! With the scheduling scope shown: no! Larger scheduling scope: es Machine Width CIS 50: Scheduling 5 CIS 50: Scheduling 6

5 Scheduling Scope! Window of instructions we can re-order in! Larger => better schedules! Compiler: theoreticall whole program! Not practical for man reasons! How?! One wa: Loop un-rolling! Others exist: not the scope of this class Loop un-rolling! Take (or more) iterations! Remove extra loop control! Getting rid of extra instructions saves time!! Note: must compare ccles, not IPC now! Re-schedule both together! Larger scope to schedule from! Register names ma need changing CIS 50: Scheduling 7 CIS 50: Scheduling 8 Loop unrolling Performance vs width ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + -> r sub r, -> r6 jnz r6, loop ld [r] -> r add r + r -> r st r -> [r] addi r + 8 -> r sub r, -> r6 jnz r6, loop ld [r] -> r addi r + 8 -> r ld -[r] -> r7 sub r, -> r6 add r + r -> r add r7 + r -> r7 st r -> -8[r] st r7 -> -[r] jnz r6, loop Millions of ccles for M iterations Machine Width Unoptimized Scheduled Unrolled x Unrolled x Unrolled 8x CIS 50: Scheduling 9 CIS 50: Scheduling 0

6 Wh not unroll K times?! More unrolling => more performance! Fewer dnamic instructions! Better scheduling! Downsides / limiting factors?! Number of registers! More static instructions => $I pressure Limitations of static scheduling! Assumes cache hits! Common case! Miss? Different schedule mabe better! Compiler must be conservative! Needs to guarantee correctness! Sometimes tough to tell if re-ordering is legal CIS 50: Scheduling CIS 50: Scheduling Re-ordering barrier: branches Re-ordering barrier: ld/st loop: jz r, not_found ld [r] -> r sub r, r -> r jz r, found ld [r] -> r jmp loop Legal to move load up? No: if r is null, will cause a fault ld [r] -> r st r -> [r] Can these be switched? No: r and r ma be same value! Technical term: alias! Two names for same memor location CIS 50: Scheduling CIS 50: Scheduling

7 An example void f(int * a, int *b, int *c, int N) { } for (int i =0; i < N; i++) { a[i] = b[i] + c[i]; } ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] Loop unrolled x ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // loop control here What can we re-order here? CIS 50: Scheduling 5 CIS 50: Scheduling 6 Loop unrolled x Aliasing problems ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // add 8 to r, r, and r Can we move this load up? No: r+ might equal r! Must be conservative! f (ptr+, ptr, ptr) not common case! but is possible! If onl we could speculate.! Allow re-ordering in the common case! Get correctness in the rare case! Anthing software can do, hardware can do better.. CIS 50: Scheduling 7 CIS 50: Scheduling 8

8 Out-of-order execution! Hardware can speculate! Load/store ordering! Branches! D$ misses?! Compiler: no idea! Hardware: knows when the happen! Out-of-order execution! Aka dnamic scheduling Out-of-order execution! Execute out of program order! Execute oldest read instruction! Read: all input values available! Reduce RAW stalls! Retain appearance of in-order! Maintain correctness CIS 50: Scheduling 9 CIS 50: Scheduling 0 Out-of-order pipeline Register renaming Fetch Decode Rename Dispatch Buffer of instructions Issue Reg-read Execute Writeback Commit! Recall static scheduling: xor r ^ r -> r sub - r -> r addi r + -> r! sub/add can be re-ordered! Must change register of sub xor r ^ r -> r sub - r -> r7 addi r7 + -> r In-order front end Out-of-order execution CIS 50: Scheduling CIS 50: Scheduling

9 Register renaming! Same principle applies to hardware! Might re-order anthing! Create unique names! Logical registers => phsical registers! : holds translation! Indexed b logical register! Holds phsical register numbers Register renaming steps! Read input numbers from map table! Allocate new phsical register! None available? => stall! Update map table with destination reg CIS 50: Scheduling CIS 50: Scheduling Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> r p r p r p r p r p r p r p r p p0 p0 CIS 50: Scheduling 5 CIS 50: Scheduling 6

10 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> r r r r p p p p p0 r r r r p p p p0 CIS 50: Scheduling 7 CIS 50: Scheduling 8 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> r p r p r r r p p p0 r r r p p p0 CIS 50: Scheduling 9 CIS 50: Scheduling 0

11 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> r p r p r p r p r r p0 r r p0 CIS 50: Scheduling CIS 50: Scheduling Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> r p r p r p r p r r p0 r r p0 CIS 50: Scheduling CIS 50: Scheduling

12 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> r r r r p p p0 r r r r p p p0 CIS 50: Scheduling 5 CIS 50: Scheduling 6 Renaming example Out-of-order pipeline xor r ^ r -> r sub - r -> r addi r + -> r r r r r p xor p ^ p -> add + p -> sub - p -> addi + -> p0 CIS 50: Scheduling 7 Buffer of instructions Fetch Decode Rename Dispatch Issue Reg-read Execute Writeback Commit Have unique register names Now put into ooo execution structures CIS 50: Scheduling 8

13 Dispatch! Renamed instructions into ooo structures! Re-order buffer (ROB)! All instruction until commit! Issue Queue! Un-executed instructions! Central piece of scheduling logic! Content Addressable Memor (CAM) RAM vs CAM! Random Access Memor! Read/write specific index! Get/set value there! Content Addressable Memor! Search for a value! Find matching indices! One structure can have ports of both tpes CIS 50: Scheduling 9 CIS 50: Scheduling 50 RAM vs CAM: RAM RAM vs CAM: CAM 7 7 Index 0 Read index Find Index 9 9 RAM: read/write specific index CIS 50: Scheduling 5 CAM: search for value CIS 50: Scheduling 5

14 Issue Queue! Holds un-executed instructions! Tracks read inputs! Phsical register names + read bit! AND to tell if read Dispatch Steps! Allocate IQ slot! Full? Stall! Read read bits of inputs! Table -bit per preg! Clear read bit of output in table! Instruction has not produced value et! Write data in IQ slot Insn Inp R Inp R Dst Age Read? CIS 50: Scheduling 5 CIS 50: Scheduling 5 Dispatch Example Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age Read bits CIS 50: Scheduling 55 p p p p xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 Read bits CIS 50: Scheduling 56 p p p p n

15 Dispatch Example Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p Read bits CIS 50: Scheduling 57 p p p p n n xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p sub p Read bits CIS 50: Scheduling 58 p p p p n n n Dispatch Example xor p ^ p -> add + p -> sub - p -> addi + -> Issue Queue Insn Inp R Inp R Dst Age xor p p 0 add n p sub p addi n --- Read bits CIS 50: Scheduling 59 p p p p n n n n Out-of-order pipeline! Execution (ooo) stages! Select read instructions! Send for execution! Wakeup dependents Issue Reg-read Execute Writeback CIS 50: Scheduling 60

16 Issue = Select + Wakeup Issue = Select + Wakeup! Select N oldest, read instructions Insn Inp R Inp R Dst Age xor p p 0 add n p sub p addi n ---!! N? xor!! N >=? xor and sub! Note: ma have resource constraints: i.e. ld/st/fp Read! Read!! Wakeup dependent instructions! CAM search for Dst in inputs! Set read! Also update read-bit table for future instructions Insn Inp R Inp R Dst Age xor p p 0 add p sub p addi --- CIS 50: Scheduling 6 CIS 50: Scheduling 6 Issue Register Read!! Select/Wakeup one ccle Dependents go back to back! Now add/addi are read: Insn Inp R Inp R Dst Age! Not done at decode! Must read phsical register (renamed)! Must be done when value read! Or gone thru when expecting bpass! Phsical register file ma be large! Multi-ccle read add p addi --- CIS 50: Scheduling 6 CIS 50: Scheduling 6

17 Renaming review Dispatch Review Everone rename this instruction: mul r * -> r r p r p r p r p p0 CIS 50: Scheduling 65 Everone dispatch this instruction: div / -> p Insn Inp R Inp R Dst Age Read bits p p p p n CIS 50: Scheduling 66 Select Review Wakeup Review Insn Inp R Inp R Dst Age add p p p 0 mul p n p div p n xor p p Determine which instructions are read. Which will be issued on a -wide machine? Which will be issued on a -wide machine? CIS 50: Scheduling 67 Insn Inp R Inp R Dst Age add p p p 0 mul p n p div p n xor p p What information will change if we issue the add? CIS 50: Scheduling 68

18 OOO execution (-wide) OOO execution (-wide) xor RDY add sub RDY addi p 7 p p p add RDY addi RDY xor p^ p -> sub - p -> p 7 p p p CIS 50: Scheduling 69 CIS 50: Scheduling 70 OOO execution (-wide) OOO execution (-wide) add +p -> addi + -> p 7 p p p xor 7^ -> sub 6 - -> p 7 p p p add _ + 9 -> addi _ + -> -> -> CIS 50: Scheduling 7 CIS 50: Scheduling 7

19 OOO execution (-wide) OOO execution (-wide) p 7 p p p > -> p 7 p p p 9 6 CIS 50: Scheduling 7 CIS 50: Scheduling 7 OOO execution (-wide) Multi-ccle operations Note similarit to in-order p 7 p p p 9 6! Multi-ccle ops (ld, fp, mul, etc)! Wakeup defered a few ccles! Structural hazzard?! Cache misses?! Speculative wake-up (assume hit)! Cancel exec of dependents! Re-issue later! Details: complex, not important CIS 50: Scheduling 75 CIS 50: Scheduling 76

20 Re-order Buffer (ROB)! All instructions in order! Purposes! Misprediction recover! In-order commit! Maintain appearance of in-order execution! Freeing of phsical registers Renaming revisited! Overwritten register! Freed at commit! Restore in map table on recover!! Also must be read at rename CIS 50: Scheduling 77 CIS 50: Scheduling 78 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> [ p ] r p r p r p r p r p r p r p r p p0 p0 CIS 50: Scheduling 79 CIS 50: Scheduling 80

21 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> [ p ] [ p ] r p r p r r r p p p0 r r r p p p0 CIS 50: Scheduling 8 CIS 50: Scheduling 8 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> [ p ] [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> [ p ] [ p ] [ ] r p r p r p r p r r p0 r r p0 CIS 50: Scheduling 8 CIS 50: Scheduling 8

22 Renaming example Renaming example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> [ p ] [ p ] [ ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] r p r p r p r p r r r r p0 p0 CIS 50: Scheduling 85 CIS 50: Scheduling 86 Renaming example ROB xor r ^ r -> r sub - r -> r addi r + -> r r r p xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ]! ROB entr holds all info for recover/commit! Logical register names! Phsical register names! Instruction tpes! Dispatch: insert at tail! Full? Stall! Commit: remove from head! Not completed? Stall r r p0 CIS 50: Scheduling 87 CIS 50: Scheduling 88

23 Recover Recover example! Completel remove wrong path instructions! Flush from IQ! Remove from ROB! Restore map table to before misprediction! Free destination registers bnz r loop xor r ^ r -> r sub - r -> r addi r + -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> addi + -> [ ] [ p ] [ p ] [ ] [ p ] CIS 50: Scheduling 89 r r p r r p0 CIS 50: Scheduling 90 Recover example Recover example bnz r loop xor r ^ r -> r sub - r -> r addi r + -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> addi + -> [ ] [ p ] [ p ] [ ] [ p ] bnz r loop xor r ^ r -> r sub - r -> r bnz p, loop xor p ^ p -> add + p -> sub - p -> [ ] [ p ] [ p ] [ ] r p r p r r p0 CIS 50: Scheduling 9 r p r p r r p0 CIS 50: Scheduling 9

24 Recover example Recover example bnz r loop xor r ^ r -> r bnz p, loop xor p ^ p -> add + p -> [ ] [ p ] [ p ] bnz r loop xor r ^ r -> r bnz p, loop xor p ^ p -> [ ] [ p ] r p r p r r p p0 CIS 50: Scheduling 9 r r r r p p p p p0 CIS 50: Scheduling 9 Recover example bnz r loop bnz p, loop [ ] What about stores! Stores: Write D$, not registers! Can we rename memor?! Recover in the cache? r r r r p p p p p0 CIS 50: Scheduling 95 CIS 50: Scheduling 96

25 What about stores! Stores: Write D$, not registers! Can we rename memor?! Recover in the cache?!! No (at least not easil)! Cache writes unrecoverable! Stores: onl when certain! Commit Commit xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + ->! Commit: instruction becomes architected state! In-order, onl when instructions are finished! Free overwritten register (wh?) [ p ] [ p ] [ ] [ p ] CIS 50: Scheduling 97 CIS 50: Scheduling 98 Freeing over-written register Commit Example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ]! P was r before xor! P6 is r after xor! Anthing older than xor should read p! Anthing ounger than xor should (until next r writing instruction! At commit of xor, no older instructions exist CIS 50: Scheduling 99 r p0 r p r r CIS 50: Scheduling 00

26 Commit Example Commit Example xor r ^ r -> r sub - r -> r addi r + -> r xor p ^ p -> add + p -> sub - p -> addi + -> [ p ] [ p ] [ ] [ p ] sub - r -> r addi r + -> r add + p -> sub - p -> addi + -> [ p ] [ ] [ p ] r p0 r p p r r CIS 50: Scheduling 0 r p0 r p p r p r CIS 50: Scheduling 0 Commit Example Commit Example sub - r -> r addi r + -> r sub - p -> addi + -> [ ] [ p ] addi r + -> r addi + -> [ p ] r p0 r p p r p r CIS 50: Scheduling 0 r p0 r p p r p r p CIS 50: Scheduling 0

27 Out of order pipeline diagrams! Standard stle: large and cumbersome! Change laout slightl! Columns = stages (dispatch, issue, etc)! Rows = instructions! Content of boxes = ccles! For our purposes: issue/exec = ccle! Ignore preg read latenc, etc! Load-use, mul, div, and FP longer Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> -wide Infinite ROB, IQ, Pregs Loads: ccles CIS 50: Scheduling 05 CIS 50: Scheduling 06 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> Ccle :! Dispatch ld and add CIS 50: Scheduling 07 Ccle :! Dispatch xor and ld! st Ld issues -- also note WB ccle while ou do this (Note: don t issue if WB ports full) CIS 50: Scheduling 08

28 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> 6 Instruction Disp Issue WB Commit Ld [p] -> p 5 add p + p -> p xor p ^ -> ld [] -> Ccle :! add and xor are not read! nd load is- issue it Ccle :! Nothing Ccle 5:! Add can issue CIS 50: Scheduling 09 CIS 50: Scheduling 0 Out of order pipeline diagrams Out of order pipeline diagrams Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Ccle 6:! st load can commit (oldest instruction and finished)! xor can issue 6 Instruction Disp Issue WB Commit Ld [p] -> p add p + p -> p xor p ^ -> ld [] -> Ccle 7:! Add can commit CIS 50: Scheduling CIS 50: Scheduling

29 Out of order pipeline diagrams Loads and stores Instruction Disp Issue WB Commit Ld [p] -> p 5 6 add p + p -> p xor p ^ -> ld [] -> 6 8 Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Ccle 8:! Commit xor and ld (-wide: can do both at once) CIS 50: Scheduling Ccle :! Can ld [ ] -> execute?! Wh or wh not? CIS 50: Scheduling Loads and stores Loads and stores Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Instruction Disp Issue WB Commit fdiv p / p ->p 5 st p -> [ ] st p -> [ ] ld [ ] -> Aliasing (again)!?!? CIS 50: Scheduling 5 Suppose and!= 7 Can ld execute now? CIS 50: Scheduling 6

30 Forwarding Forwarding: Store Queue! Stores write cache at commit! Commit is in-order, delaed b all instructions! Loads read cache! But execution is critical! Forwarding! Allow store -> load communication before store commit! Conceptuall like bpassing, but ver different implementation! Store Queue! Holds all in-flight stores! CAM: searchable b address! Age logic: determine oungest matching store older than load! Store execution! Write Store Queue! Address + Data! Load execution! Search SQ! Match? Forward! Read D$ address data in data out load position Store Queue (SQ) address age D$/TLB value head tail CIS 50: Scheduling 7 CIS 50: Scheduling 8 Load scheduling Our example from before! Store->Load Forwarding:! Get value from executed (but not comitted) store to load! Load Scheduling:! Determine when load can execute with regard to older stores! Conservative load scheduling:! All older stores have executed! Some architectures: split store address / store data! Onl require known address! Advantage: alwas safe! Disadvantage: performance (limits out-of-orderness) ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] ld [r] -> ld [r] -> r6 add + r6 -> r7 st r7 -> [r] // loop control here With conservative load scheduling, what can go out of order? CIS 50: Scheduling 9 CIS 50: Scheduling 0

31 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 Suppose wide, conservative scheduling. Ma issue load per ccle. Loads take ccles to complete. CIS 50: Scheduling CIS 50: Scheduling Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 Conservative load scheduling: can t issue ld[p]-> CIS 50: Scheduling CIS 50: Scheduling

32 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 5 CIS 50: Scheduling 6 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 7 CIS 50: Scheduling 8

33 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 9 CIS 50: Scheduling 0 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit Our -wide ooo processor ma as well be -wide in-order! CIS 50: Scheduling CIS 50: Scheduling

34 Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit 5 6 7! It would be nice if we could issue ld [p]-> in c.! Can we speculate and issue it then? Load Speculation! Speculation requires two things..! Detection of mis-speculations! How can we do this?! Recover from mis-speculations! Squash from offending load! Saw how to squash from branches: same method CIS 50: Scheduling CIS 50: Scheduling Load Queue Store Queue + Load Queue! Detects load ordering violations! Load execution: Write address into LQ! Also note an store forwarded from! Store execution: Search LQ! Younger load with same addr?! Didn t forward from ounger store? store position flush? load queue (LQ) address head tail age D$/TLB SQ head tail! Store Queue: handles forwarding! Written b stores! Searched b loads! Load Queue: detects ordering violations! Written b loads! Searched b stores! Both together! Allows aggressive load scheduling! Stores don t constrain load execution CIS 50: Scheduling 5 CIS 50: Scheduling 6

35 Our example from before Our example from before ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p]! Aggressive load scheduling?! Issue ld [p]-> in ccle Disp Issue WB Commit ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit CIS 50: Scheduling 7 CIS 50: Scheduling 8 Our example from before Aggressive Load scheduling ld [p] -> ld [p] -> add + -> st -> [p] ld [p] -> ld [p] -> add + -> p st p -> [p] Disp Issue WB Commit ! Allows loads to issue before older stores! Increases out-of-orderness +! When no conflict, increases performance -! Conflict => squash => worse performance than waiting! Some loads might forward from stores! Alwas aggressive will squash a lot! Can we have our cake AND eat it too? Saves ccles over conservative Actuall uses ooo-ness CIS 50: Scheduling 9 CIS 50: Scheduling 0

36 Predictive Load scheduling! Predict which loads must wait for stores! Fool me once, shame on ou-- fool me twice?! Loads default to aggressive! Keep table of load PCs that have been caused squashes! Schedule these conservativel +! Simple predictor -! Makes bad loads wait for all older stores is not so great! More complex predictors used in practice! Predict which stores loads should wait for Out of Order: Window Size! Scheduling scope = ooo window size! Larger = better! Constrained b phsical registers! ROB roughl limited b #preg = ROB size + #logical registers! Big register file = hard/slow! Constrained b issue queue! Limits number of un-executed instructions! CAM = can t make big (power + area)! Constrained b load + store queues! Limit number of loads/stores! CAMs CIS 50: Scheduling CIS 50: Scheduling OOO scalabilit research! Checkpoint Processing and Recover [Akkar 0]! Attacks scaling of register file! Take checkpoints at rename! Onl recover to those! Free Pregs aggressivel! Continual Flow Pipelines [Srinivasan 0]! Attacks scaling of Issue Queue! Put L misses and dependents out of IQ! Place back in when L miss returns! Store Vulnerabilit Window [Roth 05] + Store Queue Index Prediction [Sha 05]! Scalable (non-associative) load/store queues! Predict store queue index for forwarding! Filtered load re-execution prior to commit Out of Order: Benefits! Allows speculative re-ordering! Loads / stores! Branch prediction! Schedule can change due to cache misses! Different schedule optimal from on cache hit! Done b hardware! Compiler ma want different schedule for different hw configs! Hardware has onl its own configuration to deal with CIS 50: Scheduling CIS 50: Scheduling

37 Dependence tpes! RAW (Read After Write) = true dependence ld [r] -> r add r + r -> r! WAW (Write After Write) = output dependence ld [r] -> r add r + r -> r! WAR (Write After Read) = anti-dependence ld [r] -> r add r + r -> r Memor dependences! RAW (Read After Write) st r -> [r] ld [r] -> r! WAW (Write After Write) st r -> [r] st r -> [r]! WAR (Write After Read) ld [r] -> r st r -> [r] CIS 50: Scheduling 5 CIS 50: Scheduling 6 More on dependences! RAW! When more than one applies, RAW dominates: add r + r -> r addi r + -> r! Must be respected: no trick to avoid! WAR/WAW on registers! Two things happen to use same name! Can be eliminated b renaming! WAR/WAW on memor! Can t rename memor! Need to use other tricks (later this lecture) CIS 50: Scheduling 7 Out of Order: Top 5 things to know! Register renaming! How to perform is and how to recover it! Commit! Precise state (ROB)! How/when registers are freed! Issue/Select! Wakeup: CAM! Choose N oldest read instructions! Stores! Write at commit! Forward to loads via LQ! Loads! Conservative/aggressive/predictive scheduling! Violation detection CIS 50: Scheduling 8

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

Scalable Store-Load Forwarding via Store Queue Index Prediction

Scalable Store-Load Forwarding via Store Queue Index Prediction Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines This Uit: Damic Schedulig CSE 0 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

ENEE350 Lecture Notes-Weeks 14 and 15

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems

More information

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates ECE 250 / CPS 250 Computer Architecture Basics of Logic Design Boolean Algebra, Logic Gates Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

Saturday, April 23, Dependence Analysis

Saturday, April 23, Dependence Analysis Dependence Analysis Motivating question Can the loops on the right be run in parallel? i.e., can different processors run different iterations in parallel? What needs to be true for a loop to be parallelizable?

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

Compiling Techniques

Compiling Techniques Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

Computer Architecture ELEC2401 & ELEC3441

Computer Architecture ELEC2401 & ELEC3441 Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control

More information

Outcomes. Spiral 1 / Unit 2. Boolean Algebra BOOLEAN ALGEBRA INTRO. Basic Boolean Algebra Logic Functions Decoders Multiplexers

Outcomes. Spiral 1 / Unit 2. Boolean Algebra BOOLEAN ALGEBRA INTRO. Basic Boolean Algebra Logic Functions Decoders Multiplexers -2. -2.2 piral / Unit 2 Basic Boolean Algebra Logic Functions Decoders Multipleers Mark Redekopp Outcomes I know the difference between combinational and sequential logic and can name eamples of each.

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

ECE/CS 250 Computer Architecture

ECE/CS 250 Computer Architecture ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: Digital Logic Our goal for the next few weeks is to paint a a reasonably complete picture of how we can go from transistor

More information

Chapter 7. Sequential Circuits Registers, Counters, RAM

Chapter 7. Sequential Circuits Registers, Counters, RAM Chapter 7. Sequential Circuits Registers, Counters, RAM Register - a group of binary storage elements suitable for holding binary info A group of FFs constitutes a register Commonly used as temporary storage

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits. CS211 Computer Architecture Digital Logic l Topics l Transistors (Design & Types) l Logic Gates l Combinational Circuits l K-Maps Figures & Tables borrowed from:! http://www.allaboutcircuits.com/vol_4/index.html!

More information

Menu. Binary Adder EEL3701 EEL3701. Add, Subtract, Compare, ALU

Menu. Binary Adder EEL3701 EEL3701. Add, Subtract, Compare, ALU Other MSI Circuit: Adders >Binar, Half & Full Canonical forms Binar Subtraction Full-Subtractor Magnitude Comparators >See Lam: Fig 4.8 ALU Menu Look into m... 1 Binar Adder Suppose we want to add two

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

ECE/CS 250: Computer Architecture. Basics of Logic Design: Boolean Algebra, Logic Gates. Benjamin Lee

ECE/CS 250: Computer Architecture. Basics of Logic Design: Boolean Algebra, Logic Gates. Benjamin Lee ECE/CS 250: Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates Benjamin Lee Slides based on those from Alvin Lebeck, Daniel Sorin, Andrew Hilton, Amir Roth, Gershon Kedem Admin

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Clock-driven scheduling

Clock-driven scheduling Clock-driven scheduling Also known as static or off-line scheduling Michal Sojka Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Control Engineering November 8, 2017

More information

CSE370 HW3 Solutions (Winter 2010)

CSE370 HW3 Solutions (Winter 2010) CSE370 HW3 Solutions (Winter 2010) 1. CL2e, 4.9 We are asked to implement the function f(a,,c,,e) = A + C + + + CE using the smallest possible multiplexer. We can t use any extra gates or the complement

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

Performance Prediction Models

Performance Prediction Models Performance Prediction Models Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke {shrupad,lukefahr,reetudas,mahlke}@umich.edu Gem5 workshop Micro 2012 December 2, 2012 Composite Cores Pushing

More information

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University DMP Deterministic Shared Memory Multiprocessing 1 Presenter: Wu, Weiyi Yale University Outline What is determinism? How to make execution deterministic? What s the overhead of determinism? 2 What Is Determinism?

More information

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining CMU 18-447 Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining Instructor: Prof Onur Mutlu TAs: Rachata Ausavarungnirun, Kevin Chang, Albert Cho, Jeremie

More information

Equalities and Uninterpreted Functions. Chapter 3. Decision Procedures. An Algorithmic Point of View. Revision 1.0

Equalities and Uninterpreted Functions. Chapter 3. Decision Procedures. An Algorithmic Point of View. Revision 1.0 Equalities and Uninterpreted Functions Chapter 3 Decision Procedures An Algorithmic Point of View D.Kroening O.Strichman Revision 1.0 Outline Decision Procedures Equalities and Uninterpreted Functions

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design CIS 371 Computer Organization and Design Unit 7: Caches Based on slides by Prof. Amir Roth & Prof. Milo Martin CIS 371: Comp. Org. Prof. Milo Martin Caches 1 This Unit: Caches I$ Core L2 D$ Main Memory

More information

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach Process Scheduling for RTS Dr. Hugh Melvin, Dept. of IT, NUI,G RTS Scheduling Approach RTS typically control multiple parameters concurrently Eg. Flight Control System Speed, altitude, inclination etc..

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue, K. Murakami Faculty

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

More information

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk ECE290 Fall 2012 Lecture 22 Dr. Zbigniew Kalbarczyk Today LC-3 Micro-sequencer (the control store) LC-3 Micro-programmed control memory LC-3 Micro-instruction format LC -3 Micro-sequencer (the circuitry)

More information

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014 Clojure Concurrency Constructs, Part Two CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014 1 Goals Cover the material presented in Chapter 4, of our concurrency textbook In particular,

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions Lecture Slides Intro Number Systems Logic Functions EE 0 in Context EE 0 EE 20L Logic Design Fundamentals Logic Design, CAD Tools, Lab tools, Project EE 357 EE 457 Computer Architecture Using the logic

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

CSE Computer Architecture I

CSE Computer Architecture I Execution Sequence Summary CSE 30321 Computer Architecture I Lecture 17 - Multi Cycle Control Michael Niemier Department of Computer Science and Engineering Step name Instruction fetch Instruction decode/register

More information

Combinational vs. Sequential. Summary of Combinational Logic. Combinational device/circuit: any circuit built using the basic gates Expressed as

Combinational vs. Sequential. Summary of Combinational Logic. Combinational device/circuit: any circuit built using the basic gates Expressed as Summary of Combinational Logic : Computer Architecture I Instructor: Prof. Bhagi Narahari Dept. of Computer Science Course URL: www.seas.gwu.edu/~bhagiweb/cs3/ Combinational device/circuit: any circuit

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1 Computer Architecture ECE 361 Lecture 5: The Design Process & Design 361 design.1 Quick Review of Last Lecture 361 design.2 MIPS ISA Design Objectives and Implications Support general OS and C- style language

More information

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE CARNEGIE MELLON UNIVERSITY AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

CS 52 Computer rchitecture and Engineering Lecture 4 - Pipelining Krste sanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs52!

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 23 February 2017 Announcements HW#5 due HW#6 will be posted 1 Oh No, More

More information

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund Non-Preemptive and Limited Preemptive Scheduling LS 12, TU Dortmund 09 May 2017 (LS 12, TU Dortmund) 1 / 31 Outline Non-Preemptive Scheduling A General View Exact Schedulability Test Pessimistic Schedulability

More information

CSE370: Introduction to Digital Design

CSE370: Introduction to Digital Design CSE370: Introduction to Digital Design Course staff Gaetano Borriello, Brian DeRenzi, Firat Kiyak Course web www.cs.washington.edu/370/ Make sure to subscribe to class mailing list (cse370@cs) Course text

More information

Operational Semantics

Operational Semantics Operational Semantics Semantics and applications to verification Xavier Rival École Normale Supérieure Xavier Rival Operational Semantics 1 / 50 Program of this first lecture Operational semantics Mathematical

More information

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class Fault Modeling 李昆忠 Kuen-Jong Lee Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan Class Fault Modeling Some Definitions Why Modeling Faults Various Fault Models Fault Detection

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines Reminder: midterm on Tue 2/28 will cover Chapters 1-3, App A, B if you understand all slides, assignments,

More information

Operating Systems. VII. Synchronization

Operating Systems. VII. Synchronization Operating Systems VII. Synchronization Ludovic Apvrille ludovic.apvrille@telecom-paristech.fr Eurecom, office 470 http://soc.eurecom.fr/os/ @OS Eurecom Outline Synchronization issues 2/22 Fall 2017 Institut

More information

McBits: Fast code-based cryptography

McBits: Fast code-based cryptography McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography

More information

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction Pipeline no Prediction Branching completes in 2 cycles We know the target address after the second stage? PC fetch Instruction Memory Decode Check the condition Calculate the branch target address PC+4

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 20 February 2018 Announcements HW#4 was posted. About branch predictors Don

More information

Unit 8: Sequ. ential Circuits

Unit 8: Sequ. ential Circuits CPSC 121: Models of Computation Unit 8: Sequ ential Circuits Based on slides by Patrice Be lleville and Steve Wolfman Pre-Class Learning Goals By the start of class, you s hould be able to Trace the operation

More information

CMSC 313 Lecture 16 Announcement: no office hours today. Good-bye Assembly Language Programming Overview of second half on Digital Logic DigSim Demo

CMSC 313 Lecture 16 Announcement: no office hours today. Good-bye Assembly Language Programming Overview of second half on Digital Logic DigSim Demo CMSC 33 Lecture 6 nnouncement: no office hours today. Good-bye ssembly Language Programming Overview of second half on Digital Logic DigSim Demo UMC, CMSC33, Richard Chang Good-bye ssembly

More information

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD D DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD Yashasvini Sharma 1 Abstract The process scheduling, is one of the most important tasks

More information