6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute memory Clock period can be reduced by dividing the execution of an ruction into multiple cycles t C > max {t IM,t F,t,t DM,t W } = t DM (probably) write -back Hover, CPI will increase unless ructions are pipelined Page 1
How to divide the datapath into s 6.823, L8--3 Suppose memory is significantly slor than other s. In particular, suppose t IM = t DM = 10 units t = 5 units t F =t W = 1 unit Since the slost determines the clock, it may be possible to combine some s without any loss of performance Minimizing Critical Path 6.823, L8--4 0 x4. fetch decode & eg-fetch & execute t C > max {t IM,t F +t,t DM,t W } memory write -back Write-back takes much less time than other s. Suppose combined it with the memory increase the critical path by 10% Page 2
Speedup by Pipelining ignoring hazards 6.823, L8--5 For the 4- pipeline, given t IM = t DM = 10 units, t = 5 units, t F =t W = 1 unit t C could be reduced from 27 units to 10 units speedup = 2.7 Hover, if t IM = t DM =t =t F =t W = 5 units The same 4- pipeline can reduce t C from 25 units to 10 units speedup = 2.5 ut, since t IM = t DM =t =t F =t W, it is possible to achieve higher speedup with more s in the pipeline. 5- pipeline can reduce t C from 25 units to 5 units speedup = 5 n Ideal Pipeline 6.823, L8--6 1 2 3 4 ll objects go through the same s No sharing of resources beten any two s Propagation delay through all pipeline s is equal The scheduling of an object entering the pipeline is not affected by the objects in other s These conditions generally hold for industrial assembly lines. n ruction pipeline, hover, cannot satisfy the last condition. Why? Page 3
How ructions can Interact with each other in a pipeline 6.823, L8--7 n ruction in the pipeline may need a resource being used by another ruction in the pipeline structural hazard n ruction may produce data that is needed by a later ruction data hazard In the extreme case, an ruction may determine the next ruction to be executed control hazard (branches, interrupts,...) 6.823, L8--8 Feedback to esolve Hazards F 1 F 2 F 3 F 4 1 2 3 4 Controlling pipeline in this manner works provided the ruction at i+1 can complete without any interference from ructions in s 1 to i (otherwise deadlocks may occur) Feedback to previous s is used to stall or kill ructions Page 4
Technology ssumptions 6.823, L8--9 We will assume small amount of very fast memory (caches) backed up by a large, slor memory Fast (at least for integers) Multiported egister files (slor!). It makes the following timing assumption valid t IM t F t t DM t W 5- pipelined Harvard architecture will be the focus of our detailed design 5-Stage Pipelined Execution 6.823, L8--10 fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) time t0 t1 t2 t3 t4 t5 t6 t7.... ruction1 IF 1 ID 1 EX 1 M 1 W 1 ruction2 IF 2 ID 2 EX 2 M 2 W 2 ruction3 IF 3 ID 3 EX 3 M 3 W 3 ruction4 IF 4 ID 4 EX 4 M 4 W 4 ruction5 IF 5 ID 5 EX 5 M 5 W 5 Page 5
5-Stage Pipelined Execution esource Usage Diagram 6.823, L8--11 fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) esources time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 5 ID I 1 I 5 EX I 1 I 5 M I 1 I 5 W I 1 I 5 Pipelined Execution: ructions 6.823, L8--12 not quite correct! Page 6
Pipelined Execution: Need for Several s 6.823, L8--13 s and Control points 6.823, L8--14 re control points connected properly? - Load/Store ructions - ructions Page 7
Pipelined Harvard path without interlocks and jumps 6.823, L8--15 egwrite OpSel MemWrite egdst WSrc Sel Src Hardwired Control Equations: Harvard path - pipelined 6.823, L8--16 Sel = Case opcode D i, LW, SW, EQZ, NEZ s 16 ui u 16 J, JL s 26 Src = Case opcode D eg i, LW, SW OpSel = Case opcode E Func i Op LW, SW + EQZ, NEZ 0? Ignoring Jumps and ranches MemWrite = Case opcode M SW on... off WSrc = Case opcode M, i LW Mem JL, JL egdst = Case opcode W rf3 i, LW rf2 JL, JL egwrite = Case opcode W, i, LW ( 0) JL, JL on... off Page 8
Hazards 6.823, L8--17 E M W D... r1 (r0) + 10 r4 (r1) + 17... Oops! esolving Hazards 6.823, L8--18 1. Freeze earlier pipeline s until the data becomes available interlocks 2. If data is available somewhere in the datapath provide a bypass to get it to the right Page 9
6.823, L8--19 Interlocks to resolve Hazards Stall Condition E M W D... r1 (r0) + 10 r4 (r1) + 17... Stalled Stages and Pipeline ubbles 6.823, L8--20 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 M 1 W 1 ( ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 M 2 W 2 ( ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 M 3 W 3 ( ) stalled s IF 4 ID 4 EX 4 M 4 W 4 (I 5 ) IF 5 ID 5 EX 5 M 5 W 5 esource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 5 ID I 1 I 5 EX I 1 I 5 M I 1 I 5 W I 1 I 5 pipeline bubble Page 10
Interlock Control Logic worksheet 6.823, L8--21 stall C stall rf1 rf2? E M W D C dest Compare the source registers of the ruction in the decode with the destination register of the uncommitted ructions. Interlock Control Logic ignoring jumps & branches W stall W C stall M M rf1 E rf2 E re1 re2 C re C dest C dest 6.823, L8--22 E M W D C dest Should always stall if the rs field matches some rd? not every ruction writes registers not every ruction reads registers re Page 11
6.823, L8--23 Source & Destination egisters -type: op rf1 rf2 rf3 func I-type: op rf1 rf2 immediate16 J-type: op immediate26 source(s) destination rf3 (rf1) func (rf2) rf1, rf2 rf3 i rf2 (rf1) op imm rf1 rf2 LW rf2 M [(rf1) + imm] rf1 rf2 SW M [(rf1) + imm] (rf2) rf1, rf2 Z cond (rf1) true: () + imm rf1 false: () + 4 rf1 J () + imm JL r (), () + imm J (rf1) rf1 JL r (), (rf1) rf1 Deriving the Stall Signal 6.823, L8--24 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW ( 0) JL, JL on... off C re re1 = Case opcode, i, on off re2 = Case opcode on off stall = Stall if the source registers of the ruction in the decode matches the destination register of the uncommitted ructions. Page 12
The Stall Signal 6.823, L8--25 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW ( 0) JL, JL on... off C re re1 = Case opcode, i, LW, SW, Z, J, JL on J, JL off re2 = Case opcode, SW on... off stall stall = ( (rf1 = D E ). E + (rf1 D = M ). M + (rf1 D = W ). W ). re1 D + ((rf2 D = E ). E + (rf2 D = M ). M + (rf2 D = W ). W ). re2 D This is not the full story! 6.823, L8--26 Hazards due to Loads & Stores Stall Condition E M W... M[(r1)+7] (r2) r4 M[(r3)+5]... D Is there any possible data hazard in this ruction sequence? Page 13
6.823, L8--27 Hazards due to Loads & Stores depends on the memory system? E M W D M[(r1)+7] (r2) (r1)+7 = (r3)+5 data hazard r4 M[(r3)+5] Hover, the hazard is avoided because... our memory system completes writes in one cycle! Complications due to Jumps Src1 ( j / ~j ) Src2 ( / Ind) stall 6.823, L8--28 for register indirect jumps Jump? E I 1 M 104 I 1 096 DD 100 J 200 104 DD 304 DD kill jump ruction kills (not stalls) the following ruction How? ssuming no delay slot Page 14
Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps stall 6.823, L8--29 304 E M Jump? I 1 104 I 1 096 DD 100 J 200 104 DD 304 DD no delay slot Src D kill Killing the fetched ruction: Insert a mux before Src D = Case opcode D J, JL... IM ny interaction beten stall and jump? Pipelining Conditional ranches Src1 ( j / ~j ) Src2 ( / Ind) stall 6.823, L8--30 E M EQZ? I 1 zero? Src D 104 I 1 096 DD 100 EQZ r1, 200 104 DD 304 DD no delay slot ranch condition is not known until the execute what action should be taken in the decode? next lecture... Page 15