Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control Hazard n On every cycle, the hardre needs to detect and resolve all types of hazards, while keeping pipeline as filled as possible to achieve CPI=1 In real systems, CPI suffers slightly in return for higher clock speed n Need to make sure hardre adheres to the ISA contract with the programmer difficult but worth it 2 Control Hazard n Control hazards occur as a result of branches and jumps next instruction not necessarily at +4 n Unconditional jumps: Next instruction is determined by the jump instruction n Conditional branches: Next instruction depends on result of branch comparison n Possible solutions: Stall Change ISA (forrd) Speculation n Important questions to ask yourself: When do know the ess of next instruction to execute? What happen to the instructions in the rest of the pipeline? 3 4
Pipelining Branches F D E M W Sel inst correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? Challenge: Does not know target ess until EX stage 5 Not so good solution Stalling time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD - - - - - (I 4 ) 108: ADD - - - - - - (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 n Stalling: Wait 2 cycles Fetch the correct target after ess calculation is completed in EX stage n Stalling doesn t quite work: The hardre doesn t know it is a branch instruction until ID stage è What should happen at t2? Huge performance penalty if hardre alys stall 2 cycles regardless of instruction è 3x cycle time 6 Solution 1: Change ISA n Expose the fact that there is pipeline in hardre n Change ISA: The 2 instructions following branch will ALWAYS be executed regardless of the branch comparison result n The extra cycle when an instruction is alys executed regardless of the comparison result is called a branch delay slot n Compiler may insert useful instructions in the branch delay slot or NOPs e.g. instruction that may be executed regardless of the branch target Branch Delay Slot Example addi x2, x1, 4! lw x4, 16(x2)! beq x1, x0, err! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Original n ructions in delay slot must not affect the branch decision e.g. in above: they cannot modify x1 n Is the value of x4 ok? beq x1, x0, err! addi x2, x1, 4! lw x4, 16(x2)! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Rearranged delay slot 7 8
Real Processor: MIPS-I n The first generation of MIPS processor has 1 delay slot defined n Brach decision is moved to ID stage Only support very simple branch: beqz on 1 register n Compiler must find instruction to fill the delay slot or put NOP Microprocessor without Interlocked Pipeline Stages Solution 2: Speculate + Kill n Step 1: Speculate that the instruction in delay slots will be executed. n Step 2: Determine at EX stage: if branch taken, then kill the instructions in IF and ID stage if branch not taken, then do nothing n Pro: Waste cycles only in cases when branch taken n Cons: complicate hardre interact with stall Branch/Jump in delay slots? 9 Killing instructions in IF, ID time Branch taken t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID 3 - - - (I 4 ) 108: ADD IF 4 - - - - - (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 Kill instructions in pipeline Branch not taken time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 ructions (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 continue (I 3 ) 104: ADD IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 108: ADD IF 4 ID 4 EX 4 MA 4 WB new 4 (I 5 ) 112: SLL IF 5 ID 5 EX 5 MA 5 WB instruction 5 11 10 Killing ructions F Sel D kill E M W inst Mem kill correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? Note: kill signal stall signal as instruction in ID is invalid 12
Pipelining Jumps (JAL) n Unconditional jumps can be implemented similar to branches with the branch condition being alys true n JAL has additional requirements for storing return ess (+4) in the destination register rd Proceed until WB stage to write back data in register file Need to be careful with data forrding and stalling on rd n Alys kill instructions after JAL Pipelining JAL F Sel D kill E M W inst Mem kill brjmp Calc target Br Logic Bcomp? Save +4 from JAL instruction 13 Interlock Control Logic Forrd from WB stall C stall? 14 1 inst A B MD1 Y MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. R 15 16
Interlock Control Logic ignoring jumps & branches inst stall C stall? re1 C re re2 Should alys stall if an rs field matches some rd? not every instrucion writes a register not every instrucion reads a register re A B MD1 Y MD2 1 R 17 Source & Destination Registers rd func10 opcode rd [11:0] func3 opcode I/LW/JALR [11:7] [6:0] func3 opcode SW/Bcond Jump offset[24:0] opcode source(s) des0na0on rd func10, rd I rd op imm rd LW rd M [ + imm] rd SW M [ + imm], - Bcond,, - true: + imm false: + 4 J + imm - - JAL x1, + imm - x1 JALR rd, + imm rd 18 Deriving the Stall Signal ws = Case opcode JAL X1 else rd = Case opcode, i, LW,JALR (ws 0) JAL on C re re1 = Case opcode, i, LW, SW, Bcond, JALR J, JAL re2 = Case opcode, SW,Bcond... C stall stall = (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ). re1 D + (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ). re2 D on off on off The Bypass Signal Deriving it from the Stall Signal stall = ( (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ).re1 D +(( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ).re2 D ) ws = Case opcode JAL X1 else rd ASrc = ( D =ws E ). E.re1 D = Case opcode, i, LW, JALR (ws 0) JAL on No because only and i instrucions can benefit from this bypass Split E into two components: -bypass, -stall Is this correct? 19 20
Bypass and Stall Signals Split E into two components: -bypass, -stall -bypass E = Case opcode E, i (ws 0) ASrc = ( D =ws E ).-bypass E. re1 D stall = (( D =ws E ).-stall E + -stall E = Case opcode E LW, JALR (ws 0) JAL on ( D =ws M ). M + ( D =ws W ). W ). re1 D +(( D = ws E ). E + ( D = ws M ). M + ( D = ws W ). W ). re2 D Fully Bypassed path stall inst Is there s0ll a need for the stall signal? D for JAL,... ASrc BSrc A B MD1 E M W Y MD2 stall = ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D 1 R 21 22 Resolving Hazards (3) Strategy 3: Speculate on the dependence! Two cases: Guessed correctly è do nothing Guessed incorrectly è kill and restart. We ll later see examples of this approach in more complex processors. Branch Delay Slots n Post 1990s processors rarely has branch delay slot n Performance: I-cache miss at delay slot causes significant performance penalty n Delay slot complicates advanced microarchitectures e.g. super scalar processors with multiple instructions issued per cycles n Difficult to find instructions to fill deeply pipelined processors Modern processors can have up to 30 pipeline stages n Other techniques helpful branch prediction, predicated instructions, etc 23 24
In Conclusions n Control Hazards are caused by branch and jump instructions Branch/jump destination is unknown until later stages n To solve ess control hazards: Stall Expose branch delay slots to softre Do nothing (speculate branch not taken) and kill instructions if needed n itional considerations with and data forrding Acknowledgements n These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) n MIT material derived from course 6.823 n UCB material derived from course CS152, CS252 25 26