Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Similar documents
Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Review. Combined Datapath

Designing MIPS Processor

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Topics: A multiple cycle implementation. Distributed Notes

Designing Single-Cycle MIPS Processor

Instruction register. Data. Registers. Register # Memory data register

Computer Architecture Lecture 5: ISA Wrap-Up and Single-Cycle Microarchitectures

Pipeline Datapath. With some slides from: John Lazzaro and Dan Garcia

L07-L09 recap: Fundamental lesson(s)!

CSCI-564 Advanced Computer Architecture

CPU DESIGN The Single-Cycle Implementation

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

4. (3) What do we mean when we say something is an N-operand machine?

CPU DESIGN The Single-Cycle Implementation

Section 7.4: Integration of Rational Functions by Partial Fractions

CPSC 3300 Spring 2017 Exam 2

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Assignment Fall 2014

3. (2) What is the difference between fixed and hybrid instructions?

Concepts Introduced. Digital Electronics. Logic Blocks. Truth Tables

CMP N 301 Computer Architecture. Appendix C

Essentials of optimal control theory in ECON 4140

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Computer Architecture ELEC2401 & ELEC3441

Fast Path-Based Neural Branch Prediction

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

[2] Predicting the direction of a branch is not enough. What else is necessary?

Reflections on a mismatched transmission line Reflections.doc (4/1/00) Introduction The transmission line equations are given by

Portland State University ECE 587/687. Branch Prediction

Project Two RISC Processor Implementation ECE 485

[2] Predicting the direction of a branch is not enough. What else is necessary?

Linear System Theory (Fall 2011): Homework 1. Solutions


ICS 233 Computer Architecture & Assembly Language

BLOOM S TAXONOMY. Following Bloom s Taxonomy to Assess Students

EE 660: Computer Architecture Out-of-Order Processors

Chapter 4 Supervised learning:

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Problem Class 4. More State Machines (Problem Sheet 3 con t)

Simulation investigation of the Z-source NPC inverter

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Pipeline Datapath. With some slides from: John Lazzaro and Dan Garcia

Department of Industrial Engineering Statistical Quality Control presented by Dr. Eng. Abed Schokry

Momentum Equation. Necessary because body is not made up of a fixed assembly of particles Its volume is the same however Imaginary

ENEE350 Lecture Notes-Weeks 14 and 15

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

1. Tractable and Intractable Computational Problems So far in the course we have seen many problems that have polynomial-time solutions; that is, on

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

/ : Computer Architecture and Design

Spiral 1 / Unit 3

3 2D Elastostatic Problems in Cartesian Coordinates

FRTN10 Exercise 12. Synthesis by Convex Optimization

Optimal Control of a Heterogeneous Two Server System with Consideration for Power and Performance

ANOVA INTERPRETING. It might be tempting to just look at the data and wing it

III. Demonstration of a seismometer response with amplitude and phase responses at:

PhysicsAndMathsTutor.com

Implementing the Controller. Harvard-Style Datapath for DLX

Compiling Techniques

Simplified Identification Scheme for Structures on a Flexible Base

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CSE Computer Architecture I

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Lecture 3. (2) Last time: 3D space. The dot product. Dan Nichols January 30, 2018

Unit 6: Branch Prediction

PREDICTABILITY OF SOLID STATE ZENER REFERENCES

Fall 2011 Prof. Hyesoon Kim

Methods of Design-Oriented Analysis The GFT: A Final Solution for Feedback Systems

Survivable Virtual Topology Mapping To Provide Content Connectivity Against Double-Link Failures

1. State-Space Linear Systems 2. Block Diagrams 3. Exercises

Outline. Model Predictive Control: Current Status and Future Challenges. Separation of the control problem. Separation of the control problem

61C In the News. Processor Design: 5 steps

Decision Making in Complex Environments. Lecture 2 Ratings and Introduction to Analytic Network Process

1 Differential Equations for Solid Mechanics

Lecture Notes On THEORY OF COMPUTATION MODULE - 2 UNIT - 2

FRÉCHET KERNELS AND THE ADJOINT METHOD

Diffraction of light due to ultrasonic wave propagation in liquids

Lecture Notes: Finite Element Analysis, J.E. Akin, Rice University

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Joint Transfer of Energy and Information in a Two-hop Relay Channel

On the circuit complexity of the standard and the Karatsuba methods of multiplying integers

6 PM Midnight A B C D. Time. T a s k. O r d e r. Computer Architecture CTKing/TTHwang. Pipelining-1. Pipelining-3 CTKing/TTHwang

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Technical Note. ODiSI-B Sensor Strain Gage Factor Uncertainty

Classify by number of ports and examine the possible structures that result. Using only one-port elements, no more than two elements can be assembled.

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Worst-case analysis of the LPT algorithm for single processor scheduling with time restrictions

Higher Maths A1.3 Recurrence Relations - Revision

A Model-Free Adaptive Control of Pulsed GTAW

Binary addition example worked out

Setting The K Value And Polarization Mode Of The Delta Undulator

UNCERTAINTY FOCUSED STRENGTH ANALYSIS MODEL

Desert Mountain H. S. Math Department Summer Work Packet

Sources of Non Stationarity in the Semivariogram

COMPARATIVE STUDY OF ROBUST CONTROL TECHNIQUES FOR OTTO-CYCLE MOTOR CONTROL

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Microprocessor Power Analysis by Labeled Simulation

Math 116 First Midterm October 14, 2009

Lesson 81: The Cross Product of Vectors

Transcription:

18-447 Lectre 12: Pipelined Implementations: Control Hazards and Resoltions S 09 L12-1 James C. Hoe Dept of ECE, CU arch 2, 2009 Annoncements: Spring break net week!! Project 2 de the week after spring break HW3 de onday after spring break (no more homework ntil week 12) Handots: Handot #10 Project 2 (On Blackboard) Terminology S 09 L12-2 Dependencies ordering reqirement between instrctions Pipeline Hazards: (potential) violations of dependencies Hazard Resoltion: static schedle instrctions at compile time to avoid hazards dynamic detect hazard and adjst pipeline operation Stall, Forward/byapss, anything else? Pipeline Interlock: hardware mechanisms for dynamic hazard resoltion detect and enforce dependences at rn time

Forwarding Paths (v1) S 09 L12-3 dist(i,j)=3 Registers internal forward? dist(i,j)=3 b. With forwarding ID/ Rs Rt Rt Rd ForwardB ForwardA ALU /E dist(i,j)=1 Forwarding nit Data memory /E.RegisterRd E/.RegisterRd E/ dist(i,j)=2 [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] S 09 L12-4 Data Hazard Analysis (with Forwarding) R/I- Type LW SW Br J Jr ID se prodce se se se E prodce (se) se Even with data-forwarding, RAW dependence on an immediate preceding LW instrction prodces a hazard

Why not very deep pipelines? S 09 L12-5 5-stage pipeline still has plenty of combinational delay between registers Sperpipelining increase pipelining sch that even intrinsic operations (e.g. ALU, RF access, memory access) reqire mltiple stages What s the problem? Inst 0 : r1 r2 + r3 Inst 1 : r4 r1 + 2 t 0 t 1 t 2 t 3 t 4 t 5 Inst 0 F a F F b D a D b E a E E b a b W a W b Inst 1 F a F b F D a D b DE a E ba EE ba ba W ba W ba W b F a F b D a FD b DE ab DE ab E ba E ab W b a W ab WW b D b Intel P4 s Sperpipelined Integer ALU S 09 L12-6 A lower B lower 16-bit add S lower A pper B pper 16-bit add S pper 1 2 32-bit addition pipelined over 2 stages, BW=1/latency 16-bit-add No stall between back-to-back dependencies

S 09 L12-7 What if yo really can t sperpipeline? inpt 0 otpt 0 inpt 1 otpt 1 2T delay If yo can t doble the bandwidth by pipelining, dobling the resorce also dobles the bandwidth Instrction Ordering/Dependencies S 09 L12-8 Data Dependence Tre dependence or Read after Write (RAW) Instrction ti mst wait for all reqired inpt operands Anti-Dependence or Write after Read (WAR) Later write mst not clobber a still-pending earlier read Otpt dependence or Write after Write (WAW) Earlier write mst not clobber an already-finished later write Control Dependence (or Procedral Dependence) all instrctions are dependent by control flow every instrction se and set the PC Control dependence is data dependence on the PC

PC Data Hazard Analysis S 09 L12-9 R/I- Type LW SW Br J Jr se se se se se se ID prodce prodce prodce prodce prodce prodce E PC hazard distance is at least 1 Does that mean we mst stall after every instrction? stage can t know which PC to fetch net ntil the crrent PC is fetched and decoded Control Hazard by Stalling S 09 L12-10 t 0 t 1 t 2 t 3 t 4 t 5 Inst h ID ALU E Inst i ID ALU E Inst j ID Inst k ALU Inst l For non-control-flow instrctions

Control Speclation for Dmmies S 09 L12-11 Rather than waiting for tre-dependence on PC to resolve, jst gess netpc = PC+4 to keep fetching every cycle Is this a good gess? What do yo lose if yo gessed incorrectly? Only ~20% of the instrction mi is control flow ~50 % of forward control flow (i.e., if-then-else) is taken ~90% of backward control flow (i.e., loop back) is taken Over all, typically ~70% taken and ~30% not taken [Lee and Smith, 1984] Epect netpc = PC+4 ~86% of the time, bt what abot the remaining 14%? Control Speclation: PC+4 S 09 L12-12 t 0 t 1 t 2 t 3 t 4 t 5 Inst h PC ID ALU E Inst i PC+4 ID ALU Inst j PC+8 ID Inst k Inst l Inst h is a branch target first opportnity to decode Inst h shold we correct now? Inst h branch condition and target evalated in ALU When a branch resolves - branch target (Inst k )is fetched - all instrctions fetched since inst h (so called wrong-path instrctions) mst be flshed

Pipeline Flsh on isprediction S 09 L12-13 t 0 t 1 t 2 t 3 t 4 t 5 Inst h PC ID ALU E Inst i PC+4 ID killed Inst j PC+8 killed Inst k target ID ALU Inst l ID ALU ID Inst h is a branch Pipeline Flsh on isprediction S 09 L12-14 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 h i j k l m n ID h i bb k l m n h bb bb k l m n E h bb bb k l m n h bb bb k l m n branch resolved

Performance Impact S 09 L12-15 correct gess no penalty ~86% of the time incorrect gess 2 bbbles Assme no data hazards 20% control flow instrctions 70% of control flow instrctions are taken IPC = 1 / [ 1 + (0.20*0.7) * 2 ] = = 1 / [ 1 + 0.14 * 2 ] = 1 / 1.28 = 0.78 probability of a wrong gess penalty for a wrong gess Can we redce either of the two penalty terms? Redcing ispredict Penalty S 09 L12-16 PCSrc 0 1 PC prodce /ID Control ID/ PC prodce /E E/ Add 4 PC Address PC se Instrction memory Instrction Read register 1 RegWrite Read register 2 Registers Write register Write data Read data 1 Read data 2 Shift left 2 0 1 Add Add reslt ALUSrc Zero ALU ALU reslt Branch Address Write data emwrite Data memory Read data emtoreg 1 0 Instrction [15 0] 16 32 Sign etend 6 ALU control emread Instrction [20 16] Instrction [15 11] 0 1 RegDst ALUOp [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

IPS R2000 Control Flow Design S 09 L12-17 Simple address calclation based on instrction only Branch PC-offset: 16-bit fll-addition + 14-bit halfaddition Jmp PC-offset: concatenation only Simple branch condition based on RF One register relative (>, <, =) to 0 Eqality between 2 registers No addition/sbtraction necessary! An eplicit ISA design choice to enable branch resoltion in ID of a 5-stage pipeline Branch Resolved in ID S 09 L12-18.Flsh Hazard detection nit ID/ /E Control 0 E/ /ID PC 4 Instrction memory Shift left 2 Registers = ALU Data memory Sign etend Forwarding nit [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] IPC = 1 / [ 1 + (0.2*0.7) * 1 ] = 0.88

Branch Delay Slots?? Br r- L1 ID PC+4?? ID S 09 L12-19 L1?? if taken else PC+8 PC+4 is already in the pipeline throwing PC+4 away cost 1 bbble letting PC+4 finish to the end won t hrt performance R2000 defined branches to have an architectral latency of 1 instrction the instrction immediately after a branch is always eected branch target takes effect on the 2 nd instrction if delay slot can always do sefl work, effective IPC=1 withot BTB or even a pipeline flsh logic ~80% of delay slots can be filled atomatically by compilers Filling Delay Slots by Static Reordering Transformation reordering data independent (RAW, WAW, WAR) instrctions does not change program a. From before b. From target c. From fall throgh add $s1, $s2, $s3 sb $t4, $t5, $t6 add $s1, $s2, $s3 if $s2 = 0 then if $s1 = 0 then Becomes Delay slot add $s1, $s2, $s3 if $s1 = 0 then Becomes Delay slot Becomes Delay slot sb $t4, $t5, $t6 add $s1, $s2, $s3 S 09 L12-20 if $s2 = 0 then add $s1, $s2, $s3 add $s1, $s2, $s3 if $s1 = 0 then sb $t4, $t5, $t6 if $s1 = 0 then sb $t4, $t5, $t6 Safe? [Based on figres from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] within same basic block a new instrction added to nottaken path?? a new instrction added to taken??

Final Data Hazard Analysis S 09 L12-21 R/I- Type LW SW Br J Jr ID se se se prodce se se E prodce (se) With forwarding, hazard distance is 0 ecept for RAW dependence on LW where it is 1 Load delay slot semantics ensres a dependent instrction to be at least distance 2 Final PC Hazard Analysis S 09 L12-22 R/I- LW SW Br J Jr Type se se se (prodce) (prodce) (prodce) se se se ID E Hazard distance on a taken branch is 1 prodce prodce prodce Again, branch delay slot semantics ensres a dependent instrction to be at least distance 2 IPS R2000 can be interlock-free!! Hazard distance is greater in yor project, why?

aking a Better Gess S 09 L12-23 For ALU instrctions can t do better than gessing netpc=pc+4 still tricky since mst gess netpc before the crrent instrction is fetched For Branch/Jmp instrctions why not always gess in the taken direction since 70% correct again, mst gess netpc before the branch instrction is fetched (bt branch target is encoded in the instrction) st make a gess based only on the crrent fetch PC!!! Fortnately, - PC-offset branch/jmp target is static - We are allowed to be wrong some of the time The Locality Principle S 09 L12-24 One s recent past is a very good predictor of his/her near ftre. Temporal Locality: If yo jst did something, it is very likely that yo will do the same thing again soon since yo are here today, there is a good chance yo will be here again and again reglarly inverse is also tre Spatial a Locality: If yo jst did something, it is very likely yo will do something similar/related every time I find yo in this room, yo are probably sitting in the same seat yo are probably sitting near the same people

Branch Target Bffer (Oracle) S 09 L12-25 BTB (Oracle) a giant table indeed by PC retrns the gess for netpc PC BTB Instrction emory When encontering a PC for the first time, store in BTB PC + 4 if ALU/LD/ST PC+offset if Branch or Jmp?? if JR Effectively gessing branches are always taken IPC = 1 / [ 1 + (0.20*0.3) * 2 ] = 0.89