Simple Instruction-Pipelining. Pipelined Harvard Datapath

Similar documents
Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining (cont.) Pipelining Jumps


Implementing the Controller. Harvard-Style Datapath for DLX

Computer Architecture ELEC2401 & ELEC3441

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

[2] Predicting the direction of a branch is not enough. What else is necessary?

3. (2) What is the difference between fixed and hybrid instructions?

CSCI-564 Advanced Computer Architecture

CMP N 301 Computer Architecture. Appendix C

[2] Predicting the direction of a branch is not enough. What else is necessary?

CPU DESIGN The Single-Cycle Implementation

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

4. (3) What do we mean when we say something is an N-operand machine?

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

CPSC 3300 Spring 2017 Exam 2

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Processor Design & ALU Design

ICS 233 Computer Architecture & Assembly Language

Project Two RISC Processor Implementation ECE 485

CSE Computer Architecture I

Review: Single-Cycle Processor. Limits on cycle time

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

COVER SHEET: Problem#: Points

EE 660: Computer Architecture Out-of-Order Processors

L07-L09 recap: Fundamental lesson(s)!

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Design. Dr. A. Sahu. Indian Institute of Technology Guwahati

61C In the News. Processor Design: 5 steps

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

EC 413 Computer Organization

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

COMP303 Computer Architecture Lecture 11. An Overview of Pipelining

A Second Datapath Example YH16

Lecture: Pipelining Basics

Spiral 1 / Unit 3

Microprocessor Power Analysis by Labeled Simulation

UNIVERSITY OF WISCONSIN MADISON

Lecture 3, Performance

Computer Architecture

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Lecture 3, Performance

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

CPU DESIGN The Single-Cycle Implementation

Fall 2011 Prof. Hyesoon Kim

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

TEST 1 REVIEW. Lectures 1-5

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

ENEE350 Lecture Notes-Weeks 14 and 15

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Enrico Nardelli Logic Circuits and Computer Architecture

Unit 6: Branch Prediction

Figure 4.9 MARIE s Datapath

Lecture 13: Sequential Circuits, FSM

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Review. Combined Datapath

Lecture 13: Sequential Circuits, FSM

/ : Computer Architecture and Design

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

CMP 334: Seventh Class

Outcomes. Spiral 1 / Unit 2. Boolean Algebra BOOLEAN ALGEBRA INTRO. Basic Boolean Algebra Logic Functions Decoders Multiplexers

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Performance, Power & Energy

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

Outcomes. Spiral 1 / Unit 3. The Problem SYNTHESIZING LOGIC FUNCTIONS

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Department of Electrical and Computer Engineering The University of Texas at Austin

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

Next, we check the race condition to see if the circuit will work properly. Note that the minimum logic delay is a single sum.

Arithmetic and Logic Unit First Part

CprE 281: Digital Logic

Design of Digital Circuits Lecture 14: Microprogramming. Prof. Onur Mutlu ETH Zurich Spring April 2017

ALU A functional unit

VLSI Design Verification and Test Simulation CMPE 646. Specification. Design(netlist) True-value Simulator

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

Sequential Logic Worksheet

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Design for Testability

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

Preparation of Examination Questions and Exercises: Solutions

Chapter 7: Digital Components. Oregon State University School of Electrical Engineering and Computer Science. Review basic digital design concepts:

UMBC. At the system level, DFT includes boundary scan and analog test bus. The DFT techniques discussed focus on improving testability of SAFs.

Adders, subtractors comparators, multipliers and other ALU elements

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

Digital Design. Register Transfer Specification And Design

Transcription:

6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute memory Clock period can be reduced by dividing the execution of an ruction into multiple cycles t C > max {t IM,t F,t,t DM,t W } = t DM (probably) write -back Hover, CPI will increase unless ructions are pipelined Page 1

How to divide the datapath into s 6.823, L8--3 Suppose memory is significantly slor than other s. In particular, suppose t IM = t DM = 10 units t = 5 units t F =t W = 1 unit Since the slost determines the clock, it may be possible to combine some s without any loss of performance Minimizing Critical Path 6.823, L8--4 0 x4. fetch decode & eg-fetch & execute t C > max {t IM,t F +t,t DM,t W } memory write -back Write-back takes much less time than other s. Suppose combined it with the memory increase the critical path by 10% Page 2

Speedup by Pipelining ignoring hazards 6.823, L8--5 For the 4- pipeline, given t IM = t DM = 10 units, t = 5 units, t F =t W = 1 unit t C could be reduced from 27 units to 10 units speedup = 2.7 Hover, if t IM = t DM =t =t F =t W = 5 units The same 4- pipeline can reduce t C from 25 units to 10 units speedup = 2.5 ut, since t IM = t DM =t =t F =t W, it is possible to achieve higher speedup with more s in the pipeline. 5- pipeline can reduce t C from 25 units to 5 units speedup = 5 n Ideal Pipeline 6.823, L8--6 1 2 3 4 ll objects go through the same s No sharing of resources beten any two s Propagation delay through all pipeline s is equal The scheduling of an object entering the pipeline is not affected by the objects in other s These conditions generally hold for industrial assembly lines. n ruction pipeline, hover, cannot satisfy the last condition. Why? Page 3

How ructions can Interact with each other in a pipeline 6.823, L8--7 n ruction in the pipeline may need a resource being used by another ruction in the pipeline structural hazard n ruction may produce data that is needed by a later ruction data hazard In the extreme case, an ruction may determine the next ruction to be executed control hazard (branches, interrupts,...) 6.823, L8--8 Feedback to esolve Hazards F 1 F 2 F 3 F 4 1 2 3 4 Controlling pipeline in this manner works provided the ruction at i+1 can complete without any interference from ructions in s 1 to i (otherwise deadlocks may occur) Feedback to previous s is used to stall or kill ructions Page 4

Technology ssumptions 6.823, L8--9 We will assume small amount of very fast memory (caches) backed up by a large, slor memory Fast (at least for integers) Multiported egister files (slor!). It makes the following timing assumption valid t IM t F t t DM t W 5- pipelined Harvard architecture will be the focus of our detailed design 5-Stage Pipelined Execution 6.823, L8--10 fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) time t0 t1 t2 t3 t4 t5 t6 t7.... ruction1 IF 1 ID 1 EX 1 M 1 W 1 ruction2 IF 2 ID 2 EX 2 M 2 W 2 ruction3 IF 3 ID 3 EX 3 M 3 W 3 ruction4 IF 4 ID 4 EX 4 M 4 W 4 ruction5 IF 5 ID 5 EX 5 M 5 W 5 Page 5

5-Stage Pipelined Execution esource Usage Diagram 6.823, L8--11 fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) esources time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 5 ID I 1 I 5 EX I 1 I 5 M I 1 I 5 W I 1 I 5 Pipelined Execution: ructions 6.823, L8--12 not quite correct! Page 6

Pipelined Execution: Need for Several s 6.823, L8--13 s and Control points 6.823, L8--14 re control points connected properly? - Load/Store ructions - ructions Page 7

Pipelined Harvard path without interlocks and jumps 6.823, L8--15 egwrite OpSel MemWrite egdst WSrc Sel Src Hardwired Control Equations: Harvard path - pipelined 6.823, L8--16 Sel = Case opcode D i, LW, SW, EQZ, NEZ s 16 ui u 16 J, JL s 26 Src = Case opcode D eg i, LW, SW OpSel = Case opcode E Func i Op LW, SW + EQZ, NEZ 0? Ignoring Jumps and ranches MemWrite = Case opcode M SW on... off WSrc = Case opcode M, i LW Mem JL, JL egdst = Case opcode W rf3 i, LW rf2 JL, JL egwrite = Case opcode W, i, LW ( 0) JL, JL on... off Page 8

Hazards 6.823, L8--17 E M W D... r1 (r0) + 10 r4 (r1) + 17... Oops! esolving Hazards 6.823, L8--18 1. Freeze earlier pipeline s until the data becomes available interlocks 2. If data is available somewhere in the datapath provide a bypass to get it to the right Page 9

6.823, L8--19 Interlocks to resolve Hazards Stall Condition E M W D... r1 (r0) + 10 r4 (r1) + 17... Stalled Stages and Pipeline ubbles 6.823, L8--20 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 M 1 W 1 ( ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 M 2 W 2 ( ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 M 3 W 3 ( ) stalled s IF 4 ID 4 EX 4 M 4 W 4 (I 5 ) IF 5 ID 5 EX 5 M 5 W 5 esource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 5 ID I 1 I 5 EX I 1 I 5 M I 1 I 5 W I 1 I 5 pipeline bubble Page 10

Interlock Control Logic worksheet 6.823, L8--21 stall C stall rf1 rf2? E M W D C dest Compare the source registers of the ruction in the decode with the destination register of the uncommitted ructions. Interlock Control Logic ignoring jumps & branches W stall W C stall M M rf1 E rf2 E re1 re2 C re C dest C dest 6.823, L8--22 E M W D C dest Should always stall if the rs field matches some rd? not every ruction writes registers not every ruction reads registers re Page 11

6.823, L8--23 Source & Destination egisters -type: op rf1 rf2 rf3 func I-type: op rf1 rf2 immediate16 J-type: op immediate26 source(s) destination rf3 (rf1) func (rf2) rf1, rf2 rf3 i rf2 (rf1) op imm rf1 rf2 LW rf2 M [(rf1) + imm] rf1 rf2 SW M [(rf1) + imm] (rf2) rf1, rf2 Z cond (rf1) true: () + imm rf1 false: () + 4 rf1 J () + imm JL r (), () + imm J (rf1) rf1 JL r (), (rf1) rf1 Deriving the Stall Signal 6.823, L8--24 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW ( 0) JL, JL on... off C re re1 = Case opcode, i, on off re2 = Case opcode on off stall = Stall if the source registers of the ruction in the decode matches the destination register of the uncommitted ructions. Page 12

The Stall Signal 6.823, L8--25 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW ( 0) JL, JL on... off C re re1 = Case opcode, i, LW, SW, Z, J, JL on J, JL off re2 = Case opcode, SW on... off stall stall = ( (rf1 = D E ). E + (rf1 D = M ). M + (rf1 D = W ). W ). re1 D + ((rf2 D = E ). E + (rf2 D = M ). M + (rf2 D = W ). W ). re2 D This is not the full story! 6.823, L8--26 Hazards due to Loads & Stores Stall Condition E M W... M[(r1)+7] (r2) r4 M[(r3)+5]... D Is there any possible data hazard in this ruction sequence? Page 13

6.823, L8--27 Hazards due to Loads & Stores depends on the memory system? E M W D M[(r1)+7] (r2) (r1)+7 = (r3)+5 data hazard r4 M[(r3)+5] Hover, the hazard is avoided because... our memory system completes writes in one cycle! Complications due to Jumps Src1 ( j / ~j ) Src2 ( / Ind) stall 6.823, L8--28 for register indirect jumps Jump? E I 1 M 104 I 1 096 DD 100 J 200 104 DD 304 DD kill jump ruction kills (not stalls) the following ruction How? ssuming no delay slot Page 14

Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps stall 6.823, L8--29 304 E M Jump? I 1 104 I 1 096 DD 100 J 200 104 DD 304 DD no delay slot Src D kill Killing the fetched ruction: Insert a mux before Src D = Case opcode D J, JL... IM ny interaction beten stall and jump? Pipelining Conditional ranches Src1 ( j / ~j ) Src2 ( / Ind) stall 6.823, L8--30 E M EQZ? I 1 zero? Src D 104 I 1 096 DD 100 EQZ r1, 200 104 DD 304 DD no delay slot ranch condition is not known until the execute what action should be taken in the decode? next lecture... Page 15