Computer Architecture ELEC2401 & ELEC3441

Similar documents

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath

EE 660: Computer Architecture Out-of-Order Processors

Implementing the Controller. Harvard-Style Datapath for DLX

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

Simple Instruction-Pipelining (cont.) Pipelining Jumps

CSCI-564 Advanced Computer Architecture

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

3. (2) What is the difference between fixed and hybrid instructions?

4. (3) What do we mean when we say something is an N-operand machine?

ICS 233 Computer Architecture & Assembly Language

Project Two RISC Processor Implementation ECE 485

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary?

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Microprocessor Power Analysis by Labeled Simulation

CMP N 301 Computer Architecture. Appendix C

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

CPU DESIGN The Single-Cycle Implementation

CPSC 3300 Spring 2017 Exam 2

Fall 2011 Prof. Hyesoon Kim

Lecture 3, Performance

Lecture 3, Performance

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

TEST 1 REVIEW. Lectures 1-5

Design. Dr. A. Sahu. Indian Institute of Technology Guwahati

Review: Single-Cycle Processor. Limits on cycle time

L07-L09 recap: Fundamental lesson(s)!

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

ENEE350 Lecture Notes-Weeks 14 and 15

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Lecture 13: Sequential Circuits, FSM

Unit 6: Branch Prediction

Processor Design & ALU Design

61C In the News. Processor Design: 5 steps

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

Designing Single-Cycle MIPS Processor

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CS 152 Computer Architecture and Engineering. Lecture 17: Synchronization and Sequential Consistency

COMP303 Computer Architecture Lecture 11. An Overview of Pipelining

/ : Computer Architecture and Design

COVER SHEET: Problem#: Points

Lecture 13: Sequential Circuits, FSM

A Second Datapath Example YH16

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

EC 413 Computer Organization

Computer Architecture

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Portland State University ECE 587/687. Branch Prediction

CSE Computer Architecture I

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

CS 152 Computer Architecture and Engineering. Lecture 17: Synchroniza<on and Sequen<al Consistency. Last Time, Lecture 16: GPUs. NOW Handout Page 1

Review. Combined Datapath

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

Fall 2008 CSE Qualifying Exam. September 13, 2008

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Arithmetic and Logic Unit First Part

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Enrico Nardelli Logic Circuits and Computer Architecture

CPU DESIGN The Single-Cycle Implementation

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Performance, Power & Energy

UNIVERSITY OF WISCONSIN MADISON

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

Vector Lane Threading

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

CA Compiler Construction

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Spiral 1 / Unit 3

Computer Architecture

Quadratic Equations Part I

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Design of Digital Circuits Lecture 14: Microprogramming. Prof. Onur Mutlu ETH Zurich Spring April 2017

Circuit Theory ES3, EE21

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

State & Finite State Machines

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

A glance on the analytical model of power-performance performance trade-off in VLSI microprocessor design. Sapienza University of Rome

CprE 281: Digital Logic

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Lecture: Pipelining Basics

CSE 331 Winter 2018 Reasoning About Code I

CS 700: Quantitative Methods & Experimental Design in Computer Science

Implementing Absolute Addressing in a Motorola Processor (DRAFT)

ECE QUALIFYING EXAM I (QE l) January 5, 2015

State and Finite State Machines

Transcription:

Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control Hazard n On every cycle, the hardre needs to detect and resolve all types of hazards, while keeping pipeline as filled as possible to achieve CPI=1 In real systems, CPI suffers slightly in return for higher clock speed n Need to make sure hardre adheres to the ISA contract with the programmer difficult but worth it 2 Control Hazard n Control hazards occur as a result of branches and jumps next instruction not necessarily at +4 n Unconditional jumps: Next instruction is determined by the jump instruction n Conditional branches: Next instruction depends on result of branch comparison n Possible solutions: Stall Change ISA (forrd) Speculation n Important questions to ask yourself: When do know the ess of next instruction to execute? What happen to the instructions in the rest of the pipeline? 3 4

Pipelining Branches F D E M W Sel inst correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? Challenge: Does not know target ess until EX stage 5 Not so good solution Stalling time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD - - - - - (I 4 ) 108: ADD - - - - - - (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 n Stalling: Wait 2 cycles Fetch the correct target after ess calculation is completed in EX stage n Stalling doesn t quite work: The hardre doesn t know it is a branch instruction until ID stage è What should happen at t2? Huge performance penalty if hardre alys stall 2 cycles regardless of instruction è 3x cycle time 6 Solution 1: Change ISA n Expose the fact that there is pipeline in hardre n Change ISA: The 2 instructions following branch will ALWAYS be executed regardless of the branch comparison result n The extra cycle when an instruction is alys executed regardless of the comparison result is called a branch delay slot n Compiler may insert useful instructions in the branch delay slot or NOPs e.g. instruction that may be executed regardless of the branch target Branch Delay Slot Example addi x2, x1, 4! lw x4, 16(x2)! beq x1, x0, err! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Original n ructions in delay slot must not affect the branch decision e.g. in above: they cannot modify x1 n Is the value of x4 ok? beq x1, x0, err! addi x2, x1, 4! lw x4, 16(x2)! ok: add x5, x3, x4! ori x6, x0, 23!! err: sub x5, x3, x4! Rearranged delay slot 7 8

Real Processor: MIPS-I n The first generation of MIPS processor has 1 delay slot defined n Brach decision is moved to ID stage Only support very simple branch: beqz on 1 register n Compiler must find instruction to fill the delay slot or put NOP Microprocessor without Interlocked Pipeline Stages Solution 2: Speculate + Kill n Step 1: Speculate that the instruction in delay slots will be executed. n Step 2: Determine at EX stage: if branch taken, then kill the instructions in IF and ID stage if branch not taken, then do nothing n Pro: Waste cycles only in cases when branch taken n Cons: complicate hardre interact with stall Branch/Jump in delay slots? 9 Killing instructions in IF, ID time Branch taken t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID 3 - - - (I 4 ) 108: ADD IF 4 - - - - - (I 5 ) 300: SUB IF 5 ID 5 EX 5 MA 5 WB 5 Kill instructions in pipeline Branch not taken time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 ructions (I 2 ) 100: BEQ +200 IF 2 ID 2 EX 2 MA 2 WB 2 continue (I 3 ) 104: ADD IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 108: ADD IF 4 ID 4 EX 4 MA 4 WB new 4 (I 5 ) 112: SLL IF 5 ID 5 EX 5 MA 5 WB instruction 5 11 10 Killing ructions F Sel D kill E M W inst Mem kill correct target depending on Bcomp Br Logic Bcomp? Calc target Take branch? Note: kill signal stall signal as instruction in ID is invalid 12

Pipelining Jumps (JAL) n Unconditional jumps can be implemented similar to branches with the branch condition being alys true n JAL has additional requirements for storing return ess (+4) in the destination register rd Proceed until WB stage to write back data in register file Need to be careful with data forrding and stalling on rd n Alys kill instructions after JAL Pipelining JAL F Sel D kill E M W inst Mem kill brjmp Calc target Br Logic Bcomp? Save +4 from JAL instruction 13 Interlock Control Logic Forrd from WB stall C stall? 14 1 inst A B MD1 Y MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. R 15 16

Interlock Control Logic ignoring jumps & branches inst stall C stall? re1 C re re2 Should alys stall if an rs field matches some rd? not every instrucion writes a register not every instrucion reads a register re A B MD1 Y MD2 1 R 17 Source & Destination Registers rd func10 opcode rd [11:0] func3 opcode I/LW/JALR [11:7] [6:0] func3 opcode SW/Bcond Jump offset[24:0] opcode source(s) des0na0on rd func10, rd I rd op imm rd LW rd M [ + imm] rd SW M [ + imm], - Bcond,, - true: + imm false: + 4 J + imm - - JAL x1, + imm - x1 JALR rd, + imm rd 18 Deriving the Stall Signal ws = Case opcode JAL X1 else rd = Case opcode, i, LW,JALR (ws 0) JAL on C re re1 = Case opcode, i, LW, SW, Bcond, JALR J, JAL re2 = Case opcode, SW,Bcond... C stall stall = (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ). re1 D + (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ). re2 D on off on off The Bypass Signal Deriving it from the Stall Signal stall = ( (( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ).re1 D +(( D =ws E ). E + ( D =ws M ). M + ( D =ws W ). W ).re2 D ) ws = Case opcode JAL X1 else rd ASrc = ( D =ws E ). E.re1 D = Case opcode, i, LW, JALR (ws 0) JAL on No because only and i instrucions can benefit from this bypass Split E into two components: -bypass, -stall Is this correct? 19 20

Bypass and Stall Signals Split E into two components: -bypass, -stall -bypass E = Case opcode E, i (ws 0) ASrc = ( D =ws E ).-bypass E. re1 D stall = (( D =ws E ).-stall E + -stall E = Case opcode E LW, JALR (ws 0) JAL on ( D =ws M ). M + ( D =ws W ). W ). re1 D +(( D = ws E ). E + ( D = ws M ). M + ( D = ws W ). W ). re2 D Fully Bypassed path stall inst Is there s0ll a need for the stall signal? D for JAL,... ASrc BSrc A B MD1 E M W Y MD2 stall = ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + ( D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D 1 R 21 22 Resolving Hazards (3) Strategy 3: Speculate on the dependence! Two cases: Guessed correctly è do nothing Guessed incorrectly è kill and restart. We ll later see examples of this approach in more complex processors. Branch Delay Slots n Post 1990s processors rarely has branch delay slot n Performance: I-cache miss at delay slot causes significant performance penalty n Delay slot complicates advanced microarchitectures e.g. super scalar processors with multiple instructions issued per cycles n Difficult to find instructions to fill deeply pipelined processors Modern processors can have up to 30 pipeline stages n Other techniques helpful branch prediction, predicated instructions, etc 23 24

In Conclusions n Control Hazards are caused by branch and jump instructions Branch/jump destination is unknown until later stages n To solve ess control hazards: Stall Expose branch delay slots to softre Do nothing (speculate branch not taken) and kill instructions if needed n itional considerations with and data forrding Acknowledgements n These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) n MIT material derived from course 6.823 n UCB material derived from course CS152, CS252 25 26