Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Similar documents
Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Designing MIPS Processor

Instruction register. Data. Registers. Register # Memory data register

Designing Single-Cycle MIPS Processor

Review. Combined Datapath

Pipeline Datapath. With some slides from: John Lazzaro and Dan Garcia

Topics: A multiple cycle implementation. Distributed Notes

Fast Path-Based Neural Branch Prediction

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

CSCI-564 Advanced Computer Architecture

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

Pipeline Datapath. With some slides from: John Lazzaro and Dan Garcia

4. (3) What do we mean when we say something is an N-operand machine?

CPU DESIGN The Single-Cycle Implementation

[2] Predicting the direction of a branch is not enough. What else is necessary?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

Computer Architecture Lecture 5: ISA Wrap-Up and Single-Cycle Microarchitectures

3. (2) What is the difference between fixed and hybrid instructions?

[2] Predicting the direction of a branch is not enough. What else is necessary?

Concepts Introduced. Digital Electronics. Logic Blocks. Truth Tables

CMP N 301 Computer Architecture. Appendix C

10.2 Solving Quadratic Equations by Completing the Square

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Simple Instruction-Pipelining. Pipelined Harvard Datapath

L07-L09 recap: Fundamental lesson(s)!

Computer Architecture ELEC2401 & ELEC3441

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

EE 660: Computer Architecture Out-of-Order Processors

ICS 233 Computer Architecture & Assembly Language

Processor Design & ALU Design

CPU DESIGN The Single-Cycle Implementation

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Project Two RISC Processor Implementation ECE 485

Simple Instruction-Pipelining (cont.) Pipelining Jumps

BLOOM S TAXONOMY. Following Bloom s Taxonomy to Assess Students

1. Tractable and Intractable Computational Problems So far in the course we have seen many problems that have polynomial-time solutions; that is, on

Multi-Voltage Floorplan Design with Optimal Voltage Assignment


Microprocessor Power Analysis by Labeled Simulation

Computer Architecture

CPSC 3300 Spring 2017 Exam 2

Implementing the Controller. Harvard-Style Datapath for DLX

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

CSE Computer Architecture I

ENEE350 Lecture Notes-Weeks 14 and 15

Section 7.4: Integration of Rational Functions by Partial Fractions

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

FRTN10 Exercise 12. Synthesis by Convex Optimization

Problem Class 4. More State Machines (Problem Sheet 3 con t)

6 PM Midnight A B C D. Time. T a s k. O r d e r. Computer Architecture CTKing/TTHwang. Pipelining-1. Pipelining-3 CTKing/TTHwang

EC 413 Computer Organization

Discontinuous Fluctuation Distribution for Time-Dependent Problems

/ : Computer Architecture and Design

Chapter 3 MATHEMATICAL MODELING OF DYNAMIC SYSTEMS

Computer Architecture

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Lecture 3, Performance

Lecture 3, Performance

PREDICTABILITY OF SOLID STATE ZENER REFERENCES

Decision Making in Complex Environments. Lecture 2 Ratings and Introduction to Analytic Network Process

Setting The K Value And Polarization Mode Of The Delta Undulator

Chapter 4 Supervised learning:

Performance analysis of GTS allocation in Beacon Enabled IEEE

Lecture Notes: Finite Element Analysis, J.E. Akin, Rice University

Assignment Fall 2014

Joint Transfer of Energy and Information in a Two-hop Relay Channel

Unit 6: Branch Prediction

Linear System Theory (Fall 2011): Homework 1. Solutions

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Essentials of optimal control theory in ECON 4140

Lab Manual for Engrd 202, Virtual Torsion Experiment. Aluminum module

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Technical Note. ODiSI-B Sensor Strain Gage Factor Uncertainty

4.2 First-Order Logic

Reflections on a mismatched transmission line Reflections.doc (4/1/00) Introduction The transmission line equations are given by

Lecture 13: Sequential Circuits, FSM

Sources of Non Stationarity in the Semivariogram

An Investigation into Estimating Type B Degrees of Freedom

Optimal Control of a Heterogeneous Two Server System with Consideration for Power and Performance

Reducing Conservatism in Flutterometer Predictions Using Volterra Modeling with Modal Parameter Estimation

I block CLK 1 CLK 2. Oscillator - Delay block. circuit. US Al. Jun.28,2011 P21 P11 P22 P12. PlN P2N. (19) United States

3 2D Elastostatic Problems in Cartesian Coordinates

CMP 338: Third Class

Lecture Notes On THEORY OF COMPUTATION MODULE - 2 UNIT - 2

PhysicsAndMathsTutor.com

Bayes and Naïve Bayes Classifiers CS434

Lecture 3. (2) Last time: 3D space. The dot product. Dan Nichols January 30, 2018

Fall 2011 Prof. Hyesoon Kim

TESTING MEANS. we want to test. but we need to know if 2 1 = 2 2 if it is, we use the methods described last time pooled estimate of variance

Complex Variables. For ECON 397 Macroeconometrics Steve Cunningham

APPENDIX B MATRIX NOTATION. The Definition of Matrix Notation is the Definition of Matrix Multiplication B.1 INTRODUCTION

10.4 Solving Equations in Quadratic Form, Equations Reducible to Quadratics

Study on the impulsive pressure of tank oscillating by force towards multiple degrees of freedom

Theoretical and Experimental Implementation of DC Motor Nonlinear Controllers

Communication security: Formal models and proofs

Worst-case analysis of the LPT algorithm for single processor scheduling with time restrictions

Microscopic Properties of Gases

FRÉCHET KERNELS AND THE ADJOINT METHOD

Transcription:

Pipelined Datapath Lectre notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 Reading (2)

Pipeline Performance Assme time for stages is v ps for register read or write v 2ps for other stages Compare pipelined datapath with singlecycle datapath Instr Instr fetch Register read ALU op emory access Register write Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps (3) Pipeline Performance Single-cycle (T c = 8ps) Pipelined (T c = 2ps) (4) 2

If all stages are balanced v i.e., all take the same time Pipeline Speedp Inter instrction gap pipelined = Inter instrction gap nonpipelined nmber of stages If not balanced, speedp is less Speedp de to increased throghpt v Latency (time for each instrction) does not decrease (5) Basic Idea All instrctions are 32-bits Few & reglar instrction formats Alignment of memory operands (6) 3

Pipelining What makes it easy v All instrctions are the same length v Simple instrction formats v emory operands appear only in loads and stores What makes it hard? v strctral hazards: sppose we had only one memory v control hazards: need to worry abot branch instrctions v data hazards: an instrction depends on a previos instrction What really makes it hard: v eception handling v trying to improve performance with ot-of-order eection, etc. (7) Pipeline registers Need registers between stages v To hold information prodced in previos cycle Pipeline stage eection time (8) 4

Graphically Representing Pipelines Time 2 4 6 8 lw IF ID EX E WB add IF ID EX E WB Shading indicates the nit is being sed by the instrction Shading on the right half of the register file (ID or WB) or memory means the element is being read in that stage Shading on the left half means the element is being written in that stage (9) Graphically Representing Pipelines P r o g r a m e e c t i o n o r d e r ( i n i n s t r c t i o n s ) l w $, 2 ( $ ) T i m e ( i n c l o c k c y c l e s ) C C C C 2 C C 3 C C 4 C C 5 C C 6 I R e g A L U D R e g s b $, $ 2, $ 3 I R e g A L U D R e g Can help with answering qestions like: v how many cycles does it take to eecte this code? v what is the ALU doing dring cycle 4? v se this representation to help nderstand datapaths () 5

Strctral Hazard Time 2 4 6 8 lw IF ID EX E WB add IF ID EX E WB sb IF ID EX E WB add IF ID EX E WB Need to separate instrction and data memory () IF for Load, Store, Pipeline stage eection time (2) 6

ID for Load, Store, Pipeline stage eection time (3) EX for Load Pipeline stage eection time (4) 7

E for Load Pipeline stage eection time (5) WB for Load Wrong register nmber (6) 8

Corrected Datapath for Load Pipeline stage eection time (7) EX for Store Pipeline stage eection time (8) 9

E for Store Pipeline stage eection time (9) WB for Store Pipeline stage eection time (2)

Pipelining Eample add $4, $5, $6 lw $3, 24($) add $2, $3, $4 sb $, $2, $3 lw $, 2($) Note what is happening in the register file I F / I D I D / E X E X / E E / A d d 4 A d d A d d r e s l t S h i f t l e f t 2 P C A d d r e s s m e m o r y Pipeline stage eection time r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r [ 2 6 ] [ 5 ] 6 2 S i g n e t e n d 3 2 R e g D s t Z e r o A L U A L U r e s l t A d d r e s s D a t a m e m o r y (2) Pipelined Control (Simplified) (22)

Pipelined Control Eection/Address Calclation stage control lines emory access stage control lines Write-back stage control lines Instrction Reg Dst ALU Op ALU Op ALU Src Branch em Read em Write Reg write em to Reg R-format lw sw X X beq X X Control signals derived from instrction v As in single-cycle implementation Pass control signals along like data (23) Pipelined Control (24) 2

Datapath with Control IF: lw $, 9($) P C S r c C o n t r o l I D / E X E X / E E / I F / I D E X A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (25) Datapath with Control IF: sb $, $2, $3 ID: lw $, 9($) P C S r c lw C o n t r o l I D / E X E X / E E / I F / I D E X A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g 6 32 [ 5 ] S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (26) 3

Datapath with Control IF: and $2, $4, $5 P C S r c ID: sb $, $2, $3 EX: lw $, 9($) I F / I D sb C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (27) Datapath with Control IF: or $3, $6, $7 P C S r c ID: and $2, $4, $5 EX: sb $, $2, $3 E: lw $, 9($) I F / I D and C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g 6 32 [ 5 ] S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (28) 4

Datapath with Control IF: add $4, $8, $9 P C S r c ID: or $3, $6, $7 EX: and $2, $4, $5 E: sb $,.. WB: lw $, 9($) I F / I D or C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (29) Datapath with Control IF: ID: add $4, $8, $9 EX: or $3, $6, $7 E: and $2 WB: sb $,.. P C S r c I F / I D add C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g 6 32 [ 5 ] S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (3) 5

Datapath with Control IF: ID: EX: add $4, $8, $9 E: or $3,.. WB: and $2 P C S r c I F / I D C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (3) Datapath with Control IF: ID: EX: E: add $4,.. WB: or $3 P C S r c I F / I D C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g 6 32 [ 5 ] S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (32) 6

Datapath with Control IF: ID: EX: E: WB: add $4.. P C S r c C o n t r o l I D / E X E X / E E / I F / I D E X A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (33) Data Hazards (4.7) An instrction depends on completion of data access by a previos instrction v add $s, $t, $t sb $t2, $s, $t3 (34) 7

Dependencies Problem with starting net instrction before first is finished v dependencies that go backward in time are data hazards Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $, $3 CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg CC 7 CC 8 CC 9 / 2 2 2 2 2 D Reg and $2, $2, $5 I Reg D Reg or $3, $6, $2 I Reg D Reg add $4, $2, $2 I Reg D Reg sw $5, ($2) I Reg D Reg (35) Have compiler garantee no hazards Where do we insert the nops? sb $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 sw $5, ($2) Software Soltion Problem: this really slows s down! (36) 8

A Better Soltion Consider this seqence: sb $2, $,$3 and $2,$2,$5 or $3,$6,$2 add $4,$2,$2 sw $5,($2) We can resolve hazards with forwarding v How do we detect when to forward? (37) Dependencies & Forwarding Do not wait for reslts to be written to the register file find them in the pipeline! forward to ALU (38) 9

Forwarding IF: add $4, $8, $9 P C S r c ID: or $3, $6, $7 EX: and $6, $4, $5 E: sb $,.. WB: lw $, 9($) I F / I D or C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g [ 5 ] 6 32 S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (39) Forwarding (simplified) ID/EX EX/E E/WB Register File ALU Data emory UX (4) 2

Forwarding (from EX/E) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX (4) Forwarding (from E/WB) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX (42) 2

Forwarding (operand selection) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX Forwarding Unit (43) Forwarding (operand propagation) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX Rd Rt UX Rt Rs Forwarding Unit EX/E Rd E/WB Rd Combinational Logic! (44) 22

Detecting the Need to Forward Pass register nmbers along pipeline v e.g., ID/EX.RegisterRs = register nmber for Rs sitting in ID/EX pipeline register ALU operand register nmbers in EX stage are given by v ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when a. EX/E.RegisterRd = ID/EX.RegisterRs b. EX/E.RegisterRd = ID/EX.RegisterRt Fwd from EX/E pipeline reg 2a. E/WB.RegisterRd = ID/EX.RegisterRs 2b. E/WB.RegisterRd = ID/EX.RegisterRt Fwd from E/WB pipeline reg (45) Detecting the Need to Forward Bt only if forwarding instrction will write to a register! v EX/E.RegWrite, E/WB.RegWrite And only if Rd for that instrction is not $zero v EX/E.RegisterRd, E/WB.RegisterRd (46) 23

Forwarding Paths (47) Forwarding Conditions EX hazard v if (EX/E.RegWrite and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRs)) ForwardA = v if (EX/E.RegWrite and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRt)) ForwardB = E hazard v if (E/WB.RegWrite and (E/WB.RegisterRd ) and (E/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = v if (E/WB.RegWrite and (E/WB.RegisterRd ) and (E/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = (48) 24

Doble Data Hazard Consider the seqence: add $,$,$2 add $,$,$3 add $,$,$4 Both hazards occr v Want to se the most recent Revise E hazard condition v Only forward if EX hazard condition isn t tre (49) E hazard Revised Forwarding Condition v if (E/WB.RegWrite and (E/WB.RegisterRd ) and not (EX/E.RegWrite and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRs)) and (E/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = v if (E/WB.RegWrite and (E/WB.RegisterRd ) and not (EX/E.RegWrite and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRt)) and (E/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = v Checking precedence of EX hazard (5) 25

Datapath with Forwarding (5) Concrrent Eection Correct eection is abot managing dependencies v Prodcer-consmer v Strctral (sing the same hardware component) We will come across other types of dependencies later! (52) 26

Load-Use Data Hazard Need to stall for one cycle (53) Forwarding IF: add $4, $8, $9 P C S r c ID: or $3, $6, $7 EX: and $6, $4, $ E: lw $, ($2) WB: lw $, 9($) I F / I D or C o n t r o l I D / E X E X E X / E E / A d d P C 4 A d d r e s s m e m o r y r e g i s t e r r e g i s t e r 2 R e g i s t e r s r e g i s t e r R e g 2 S h i f t l e f t 2 A d d A d d r e s l t A L U S r c Z e r o A L U A L U r e s l t B r a n c h e m A d d r e s s D a t a m e m o r y e m t o R e g 6 32 [ 5 ] S i g n e t e n d 6 A L U c o n t r o l e m [ 2 6 ] [ 5 ] R e g D s t A L U O p (54) 27

Load-Use Hazard Detection Check when sing instrction is decoded in ID stage ALU operand register nmbers in ID stage are given by v IF/ID.RegisterRs, IF/ID.RegisterRt Load-se hazard when v ID/EX.emRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bbble (55) Code Schedling to Avoid Stalls Reorder code to avoid se of load reslt in the net instrction C code for A = B + E; C = B + F; stall stall lw $t, ($t) lw $t2, 4($t) add $t3, $t, $t2 sw $t3, 2($t) lw $t4, 8($t) add $t5, $t, $t4 sw $t5, 6($t) 3 cycles lw $t, ($t) lw $t2, 4($t) lw $t4, 8($t) add $t3, $t, $t2 sw $t3, 2($t) add $t5, $t, $t4 sw $t5, 6($t) cycles (56) 28

How to Stall the Pipeline Force control vales in ID/EX register to v EX, E and WB perform a nop (no-operation) Prevent pdate of PC and IF/ID register v Using instrction is decoded again v Following instrction is fetched again v -cycle stall allows E to read data for lw o Can sbseqently forward to EX stage (57) Stall/Bbble in the Pipeline Stall inserted here (58) 29

Stall/Bbble in the Pipeline Or, more accrately (59) Datapath with Hazard Detection Pipeline stage eection time ALUSrc m is missing! (6) 3

Control Hazards (4.8) Branch instrction determines flow of control v Fetching net instrction depends on branch otcome v Pipeline cannot always fetch correct instrction o Still working on ID stage of branch In IPS pipeline v Need to compare registers and determine the branch condition (6) Branch Hazards If branch otcome determined in E Flsh these instrctions (Set control vales to ) PC (62) 3

Redcing Branch Delay ove hardware to determine otcome to ID stage v Target address adder v Register comparator v Add IF.Flsh signal to sqash IF/ID register Eample: branch taken 36: sb $, $4, $8 4: beq $, $3, 72 44: and $2, $2, $5 48: or $3, $2, $6 52: add $4, $4, $2 56: slt $5, $6, $7... 72: lw $4, 5($7) (63) Eample: Branch Taken (64) 32

Eample: Branch Taken (65) Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instrction add $, $2, $3 IF ID EX E WB add $4, $5, $6 IF ID EX E WB IF ID EX E WB beq $, $4, target IF ID EX E WB n Can resolve sing forwarding (66) 33

Data Hazards for Branches If a comparison register is a destination of preceding ALU instrction or 2 nd preceding load instrction v Need stall cycle lw $, addr IF ID EX E WB add $4, $5, $6 IF ID EX E WB beq stalled IF ID beq $, $4, target ID EX E WB (67) Data Hazards for Branches If a comparison register is a destination of immediately preceding load instrction v Need 2 stall cycles lw $, addr IF ID EX E WB beq stalled IF ID beq stalled ID beq $, $, target ID EX E WB (68) 34

Delay Slot (IPS) Epose pipeline Load and jmp/branch entail a delay slot The instrction right after the jmp or branch is eected before the jmp/branch jal add lw fnction_a $4, $5, $6 ; eected before jmp $2, 8($4) ; eected after retrn Jmp/branch and the delay slot instrction are considered indivisible In the delay slot, the compiler needs to schedle v A sefl instrction (either before the jmp, or after the jmp w/o side effect) v otherwise a NOP (69) Branch Prediction Longer pipelines cannot readily determine branch otcome early v Stall penalty becomes nacceptable Predict otcome of branch v Only stall if prediction is wrong In IPS pipeline v Can predict branches not taken v Fetch instrction after branch, with no delay (7) 35

IPS with Predict Not Taken Prediction correct Prediction incorrect (7) -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! oter: inner: beq,, inner beq,, oter n n ispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner loop net time arond (72) 36

2-Bit Predictor: State achine Only change prediction on two sccessive mispredictions (73) ore-realistic Branch Prediction Static branch prediction v Based on typical branch behavior v Eample: loop and if-statement branches o Predict backward branches taken o Predict forward branches not taken Dynamic branch prediction v Hardware measres actal branch behavior o e.g., record recent history of each branch v Assme ftre behavior will contine the trend o When wrong, stall while re-fetching, and pdate history (74) 37

AD Bobcat ECE 6 Later in this corse ECE 6 Instrction Level Parallelism (ILP) Later in this corse http://hothardware.com (75) Intel Sandy Bridge bdti.com (76) 38

Eceptions and Interrpts (4.9) Unepected events reqiring change in flow of control v Different ISAs se the terms differently Eception v Arises within the CPU o e.g., ndefined opcode, overflow, syscall, Interrpt v From an eternal I/O controller Dealing with them withot sacrificing performance is hard (77) Handling Eceptions In IPS, eceptions managed by a System Control Coprocessor (CP) Save PC of offending (or interrpted) instrction v In IPS: Eception Program Conter (EPC) Save indication of the problem v In IPS: Case register v We ll assme -bit o for ndefined opcode, for overflow Jmp to handler at 88 (78) 39

An Alternate echanism Vectored Interrpts v Handler address determined by the case Eample: v Undefined opcode: C v Overflow: C 2 v : C 4 Instrctions either v Deal with the interrpt, or v Jmp to real handler (79) Handler Actions Read case, and transfer to relevant handler Determine action reqired If restartable v Take corrective action v se EPC to retrn to program Otherwise v Terminate program v Report error sing EPC, case, (8) 4

Another form of control hazard Eceptions in a Pipeline Consider overflow on add in EX stage add $, $2, $ v Prevent $ from being clobbered v Complete previos instrctions v Flsh add and sbseqent instrctions v Set Case and EPC register vales v Transfer control to handler Similar to mispredicted branch v Use mch of the same hardware (8) Pipeline with Eceptions (82) 4

Eception Properties Restartable eceptions v Pipeline can flsh the instrction v Handler eectes, then retrns to the instrction o Re-fetched and eected from scratch PC saved in EPC register v Identifies casing instrction v Actally PC + 4 is saved o Handler mst adjst (83) Eception Eample Eception on add in 4 sb $, $2, $4 44 and $2, $2, $5 48 or $3, $2, $6 4C add $, $2, $ 5 slt $5, $6, $7 54 lw $6, 5($7) Handler 88 sw $25, ($) 884 sw $26, 4($) (84) 42

Eception Eample (85) Eception Eample (86) 43

ltiple Eceptions Pipelining overlaps mltiple instrctions v Cold have mltiple eceptions at once Simple approach: deal with eception from earliest instrction v Flsh sbseqent instrctions v Precise eceptions In comple pipelines v ltiple instrctions issed per cycle v Ot-of-order completion v aintaining precise eceptions is difficlt! (87) Imprecise Eceptions Jst stop pipeline and save state v Inclding eception case(s) Let the handler work ot v Which instrction(s) had eceptions v Which to complete or flsh o ay reqire manal completion Simplifies hardware, bt more comple handler software Not feasible for comple mltiple-isse ot-of-order pipelines (88) 44

Performance How do we assess the impact of stall cycles? How close do we approach the ideal of one instrction per cycle eection time? Back to the CPI model! (89) Recall: Program Eection time Nmber of instrction classes # n & EectionTime = % C i CPI ( i= i cycle_time $ % '( ~= Instrction_cont * CPI avg * clock_cycle_time algorithms/compiler architectre technology Clock Cycles CPI avg = Instrction Cont = n i= " CPI i Instrction Cont % i $ ' # Instrction Cont & Relative freqency (9) 45

Assessing Performance Ideal CPI is increased by dependencies Performance impact on CPI can be assessed by compting the impact on a per instrction basis Increase in CPI = Base CPI + Probability_of_event * penalty_for_event v For eample, an event may be a branch misprediction or the occrrence of a data hazard v The probability is compted for the occrrence of the event on an instrction Eamples: pipelined processors (9) Instrction-Level Parallelism (ILP)(4.) Pipelining: eecting mltiple instrctions in parallel To increase ILP v Deeper pipeline o Less work per stage shorter clock cycle v ltiple isse o Replicate pipeline stages mltiple pipelines o Start mltiple instrctions per clock cycle o CPI <, so se Instrctions Per Cycle (IPC) o E.g., 4GHz 4-way mltiple-isse n 6 BIPS, peak CPI =.25, peak IPC = 4 o Bt dependencies redce this in practice (92) 46

ltiple Isse Static mltiple isse v Compiler grops instrctions to be issed together v Packages them into isse slots v Compiler detects and avoids hazards Dynamic mltiple isse v CPU eamines instrction stream and chooses instrctions to isse each cycle v Compiler can help by reordering instrctions v CPU resolves hazards sing advanced techniqes at rntime (93) IPS with Static Dal Isse Two-isse packets v One ALU/branch instrction v One load/store instrction v 64-bit aligned o o ALU/branch, then load/store Pad an nsed instrction with nop Address Instrction type Pipeline Stages n ALU/branch IF ID EX E WB n + 4 Load/store IF ID EX E WB n + 8 ALU/branch IF ID EX E WB n + 2 Load/store IF ID EX E WB n + 6 ALU/branch IF ID EX E WB n + 2 Load/store IF ID EX E WB (94) 47

IPS with Static Dal Isse ALU comptation Address comptation Aligned Instrction Pair ALU operation Load/store (95) Instrction Level Parallelism (ILP) ltiple instrctions in EX at the same time IF ID E WB Single (program) thread of eection Isse mltiple instrctions from the same instrction stream Average CPI< Often called ot of order (OOO) cores (96) 48

Dynamically Schedled CPU Preserves dependencies Hold pending operands Reorders bffer for register writes Can spply operands for issed instrctions Reslts also sent to any waiting reservation stations (97) AD Blldozer form.beyond3d.comb (98) 49

AD Bobcat ECE 6 Later in this corse ECE 6 Instrction Level Parallelism (ILP) Later in this corse http://hothardware.com (99) The P4 icroarchitectre From, The icroarchitectre of the Pentim 4 Processor, G. Hinton et.al, Intel Technology Jornal Q, 2 () 5

Stdy Gide Given a code block, and initial register vales (those that are accessed) be able to determine state of all pipeline registers at some ftre clock cycle. Determine the size of each pipeline register Track pipeline state in the case of forwarding and branches Compte the nmber of cycles to eecte a code block odify the datapath to inclde forwarding and hazard detection for branches (this is trickier and time consming bt well worth it) () Stdy Gide (cont.) Schedle code (manally) to improve performance, for eample to eliminate hazards and fill delay slots odify the data path to add new instrctions sch as j odify the data path to accommodate a two cycle data memory access, i.e., the data memory itself is a two cycle pipeline v odify the forwarding and hazard control logic Given a code seqence, be able to compte the nmber of stall cycles (2) 5

Stdy Gide (cont.) Track the state of the 2-bit branch predictor over a seqence of branches in a code segment, for eample a for-loop Show the pipeline state before and after an eception has taken place. (3) Glossary Branch prediction Branch hazards Branch delay Control hazard Data hazard Delay slot Dynamic instrction isse Forwarding Imprecise eception Instrction schedling Instrction level parallelism (ILP) Load-to-se hazard Pipeline bbbles Stall cycles Static instrction isse Strctral hazard (4) 52