18-447 Lectre 12: Pipelined Implementations: Control Hazards and Resoltions S 09 L12-1 James C. Hoe Dept of ECE, CU arch 2, 2009 Annoncements: Spring break net week!! Project 2 de the week after spring break HW3 de onday after spring break (no more homework ntil week 12) Handots: Handot #10 Project 2 (On Blackboard) Terminology S 09 L12-2 Dependencies ordering reqirement between instrctions Pipeline Hazards: (potential) violations of dependencies Hazard Resoltion: static schedle instrctions at compile time to avoid hazards dynamic detect hazard and adjst pipeline operation Stall, Forward/byapss, anything else? Pipeline Interlock: hardware mechanisms for dynamic hazard resoltion detect and enforce dependences at rn time
Forwarding Paths (v1) S 09 L12-3 dist(i,j)=3 Registers internal forward? dist(i,j)=3 b. With forwarding ID/ Rs Rt Rt Rd ForwardB ForwardA ALU /E dist(i,j)=1 Forwarding nit Data memory /E.RegisterRd E/.RegisterRd E/ dist(i,j)=2 [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] S 09 L12-4 Data Hazard Analysis (with Forwarding) R/I- Type LW SW Br J Jr ID se prodce se se se E prodce (se) se Even with data-forwarding, RAW dependence on an immediate preceding LW instrction prodces a hazard
Why not very deep pipelines? S 09 L12-5 5-stage pipeline still has plenty of combinational delay between registers Sperpipelining increase pipelining sch that even intrinsic operations (e.g. ALU, RF access, memory access) reqire mltiple stages What s the problem? Inst 0 : r1 r2 + r3 Inst 1 : r4 r1 + 2 t 0 t 1 t 2 t 3 t 4 t 5 Inst 0 F a F F b D a D b E a E E b a b W a W b Inst 1 F a F b F D a D b DE a E ba EE ba ba W ba W ba W b F a F b D a FD b DE ab DE ab E ba E ab W b a W ab WW b D b Intel P4 s Sperpipelined Integer ALU S 09 L12-6 A lower B lower 16-bit add S lower A pper B pper 16-bit add S pper 1 2 32-bit addition pipelined over 2 stages, BW=1/latency 16-bit-add No stall between back-to-back dependencies
S 09 L12-7 What if yo really can t sperpipeline? inpt 0 otpt 0 inpt 1 otpt 1 2T delay If yo can t doble the bandwidth by pipelining, dobling the resorce also dobles the bandwidth Instrction Ordering/Dependencies S 09 L12-8 Data Dependence Tre dependence or Read after Write (RAW) Instrction ti mst wait for all reqired inpt operands Anti-Dependence or Write after Read (WAR) Later write mst not clobber a still-pending earlier read Otpt dependence or Write after Write (WAW) Earlier write mst not clobber an already-finished later write Control Dependence (or Procedral Dependence) all instrctions are dependent by control flow every instrction se and set the PC Control dependence is data dependence on the PC
PC Data Hazard Analysis S 09 L12-9 R/I- Type LW SW Br J Jr se se se se se se ID prodce prodce prodce prodce prodce prodce E PC hazard distance is at least 1 Does that mean we mst stall after every instrction? stage can t know which PC to fetch net ntil the crrent PC is fetched and decoded Control Hazard by Stalling S 09 L12-10 t 0 t 1 t 2 t 3 t 4 t 5 Inst h ID ALU E Inst i ID ALU E Inst j ID Inst k ALU Inst l For non-control-flow instrctions
Control Speclation for Dmmies S 09 L12-11 Rather than waiting for tre-dependence on PC to resolve, jst gess netpc = PC+4 to keep fetching every cycle Is this a good gess? What do yo lose if yo gessed incorrectly? Only ~20% of the instrction mi is control flow ~50 % of forward control flow (i.e., if-then-else) is taken ~90% of backward control flow (i.e., loop back) is taken Over all, typically ~70% taken and ~30% not taken [Lee and Smith, 1984] Epect netpc = PC+4 ~86% of the time, bt what abot the remaining 14%? Control Speclation: PC+4 S 09 L12-12 t 0 t 1 t 2 t 3 t 4 t 5 Inst h PC ID ALU E Inst i PC+4 ID ALU Inst j PC+8 ID Inst k Inst l Inst h is a branch target first opportnity to decode Inst h shold we correct now? Inst h branch condition and target evalated in ALU When a branch resolves - branch target (Inst k )is fetched - all instrctions fetched since inst h (so called wrong-path instrctions) mst be flshed
Pipeline Flsh on isprediction S 09 L12-13 t 0 t 1 t 2 t 3 t 4 t 5 Inst h PC ID ALU E Inst i PC+4 ID killed Inst j PC+8 killed Inst k target ID ALU Inst l ID ALU ID Inst h is a branch Pipeline Flsh on isprediction S 09 L12-14 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 h i j k l m n ID h i bb k l m n h bb bb k l m n E h bb bb k l m n h bb bb k l m n branch resolved
Performance Impact S 09 L12-15 correct gess no penalty ~86% of the time incorrect gess 2 bbbles Assme no data hazards 20% control flow instrctions 70% of control flow instrctions are taken IPC = 1 / [ 1 + (0.20*0.7) * 2 ] = = 1 / [ 1 + 0.14 * 2 ] = 1 / 1.28 = 0.78 probability of a wrong gess penalty for a wrong gess Can we redce either of the two penalty terms? Redcing ispredict Penalty S 09 L12-16 PCSrc 0 1 PC prodce /ID Control ID/ PC prodce /E E/ Add 4 PC Address PC se Instrction memory Instrction Read register 1 RegWrite Read register 2 Registers Write register Write data Read data 1 Read data 2 Shift left 2 0 1 Add Add reslt ALUSrc Zero ALU ALU reslt Branch Address Write data emwrite Data memory Read data emtoreg 1 0 Instrction [15 0] 16 32 Sign etend 6 ALU control emread Instrction [20 16] Instrction [15 11] 0 1 RegDst ALUOp [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
IPS R2000 Control Flow Design S 09 L12-17 Simple address calclation based on instrction only Branch PC-offset: 16-bit fll-addition + 14-bit halfaddition Jmp PC-offset: concatenation only Simple branch condition based on RF One register relative (>, <, =) to 0 Eqality between 2 registers No addition/sbtraction necessary! An eplicit ISA design choice to enable branch resoltion in ID of a 5-stage pipeline Branch Resolved in ID S 09 L12-18.Flsh Hazard detection nit ID/ /E Control 0 E/ /ID PC 4 Instrction memory Shift left 2 Registers = ALU Data memory Sign etend Forwarding nit [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] IPC = 1 / [ 1 + (0.2*0.7) * 1 ] = 0.88
Branch Delay Slots?? Br r- L1 ID PC+4?? ID S 09 L12-19 L1?? if taken else PC+8 PC+4 is already in the pipeline throwing PC+4 away cost 1 bbble letting PC+4 finish to the end won t hrt performance R2000 defined branches to have an architectral latency of 1 instrction the instrction immediately after a branch is always eected branch target takes effect on the 2 nd instrction if delay slot can always do sefl work, effective IPC=1 withot BTB or even a pipeline flsh logic ~80% of delay slots can be filled atomatically by compilers Filling Delay Slots by Static Reordering Transformation reordering data independent (RAW, WAW, WAR) instrctions does not change program a. From before b. From target c. From fall throgh add $s1, $s2, $s3 sb $t4, $t5, $t6 add $s1, $s2, $s3 if $s2 = 0 then if $s1 = 0 then Becomes Delay slot add $s1, $s2, $s3 if $s1 = 0 then Becomes Delay slot Becomes Delay slot sb $t4, $t5, $t6 add $s1, $s2, $s3 S 09 L12-20 if $s2 = 0 then add $s1, $s2, $s3 add $s1, $s2, $s3 if $s1 = 0 then sb $t4, $t5, $t6 if $s1 = 0 then sb $t4, $t5, $t6 Safe? [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] within same basic block a new instrction added to nottaken path?? a new instrction added to taken??
Final Data Hazard Analysis S 09 L12-21 R/I- Type LW SW Br J Jr ID se se se prodce se se E prodce (se) With forwarding, hazard distance is 0 ecept for RAW dependence on LW where it is 1 Load delay slot semantics ensres a dependent instrction to be at least distance 2 Final PC Hazard Analysis S 09 L12-22 R/I- LW SW Br J Jr Type se se se (prodce) (prodce) (prodce) se se se ID E Hazard distance on a taken branch is 1 prodce prodce prodce Again, branch delay slot semantics ensres a dependent instrction to be at least distance 2 IPS R2000 can be interlock-free!! Hazard distance is greater in yor project, why?
aking a Better Gess S 09 L12-23 For ALU instrctions can t do better than gessing netpc=pc+4 still tricky since mst gess netpc before the crrent instrction is fetched For Branch/Jmp instrctions why not always gess in the taken direction since 70% correct again, mst gess netpc before the branch instrction is fetched (bt branch target is encoded in the instrction) st make a gess based only on the crrent fetch PC!!! Fortnately, - PC-offset branch/jmp target is static - We are allowed to be wrong some of the time The Locality Principle S 09 L12-24 One s recent past is a very good predictor of his/her near ftre. Temporal Locality: If yo jst did something, it is very likely that yo will do the same thing again soon since yo are here today, there is a good chance yo will be here again and again reglarly inverse is also tre Spatial a Locality: If yo jst did something, it is very likely yo will do something similar/related every time I find yo in this room, yo are probably sitting in the same seat yo are probably sitting near the same people
Branch Target Bffer (Oracle) S 09 L12-25 BTB (Oracle) a giant table indeed by PC retrns the gess for netpc PC BTB Instrction emory When encontering a PC for the first time, store in BTB PC + 4 if ALU/LD/ST PC+offset if Branch or Jmp?? if JR Effectively gessing branches are always taken IPC = 1 / [ 1 + (0.20*0.3) * 2 ] = 0.89