Simple Instruction-Pipelining. Pipelined Harvard Datapath

Similar documents
Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining (cont.) Pipelining Jumps


Implementing the Controller. Harvard-Style Datapath for DLX

Computer Architecture ELEC2401 & ELEC3441

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

CSCI-564 Advanced Computer Architecture

[2] Predicting the direction of a branch is not enough. What else is necessary?

3. (2) What is the difference between fixed and hybrid instructions?

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

[2] Predicting the direction of a branch is not enough. What else is necessary?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

CPU DESIGN The Single-Cycle Implementation

4. (3) What do we mean when we say something is an N-operand machine?

CMP N 301 Computer Architecture. Appendix C

ICS 233 Computer Architecture & Assembly Language

CPSC 3300 Spring 2017 Exam 2

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Project Two RISC Processor Implementation ECE 485

EE 660: Computer Architecture Out-of-Order Processors

Processor Design & ALU Design

Design. Dr. A. Sahu. Indian Institute of Technology Guwahati

COVER SHEET: Problem#: Points

CSE Computer Architecture I

Review: Single-Cycle Processor. Limits on cycle time

COMP303 Computer Architecture Lecture 11. An Overview of Pipelining

L07-L09 recap: Fundamental lesson(s)!

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

Microprocessor Power Analysis by Labeled Simulation

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

EC 413 Computer Organization

Lecture: Pipelining Basics

Figure 4.9 MARIE s Datapath

Lecture 3, Performance

Lecture 3, Performance

A Second Datapath Example YH16

UNIVERSITY OF WISCONSIN MADISON

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

61C In the News. Processor Design: 5 steps

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

TEST 1 REVIEW. Lectures 1-5

Spiral 1 / Unit 3

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CPU DESIGN The Single-Cycle Implementation

Computer Architecture

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

ENEE350 Lecture Notes-Weeks 14 and 15

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Unit 6: Branch Prediction

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Fall 2011 Prof. Hyesoon Kim

Enrico Nardelli Logic Circuits and Computer Architecture

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

/ : Computer Architecture and Design

Review. Combined Datapath

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Department of Electrical and Computer Engineering The University of Texas at Austin

Outcomes. Spiral 1 / Unit 2. Boolean Algebra BOOLEAN ALGEBRA INTRO. Basic Boolean Algebra Logic Functions Decoders Multiplexers

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

Lecture 13: Sequential Circuits, FSM

Computer Architecture

CMP 334: Seventh Class

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Lecture 13: Sequential Circuits, FSM

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Implementing Absolute Addressing in a Motorola Processor (DRAFT)

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Performance, Power & Energy

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

Arithmetic and Logic Unit First Part

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Design of Digital Circuits Lecture 14: Microprogramming. Prof. Onur Mutlu ETH Zurich Spring April 2017

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Fall 2008 CSE Qualifying Exam. September 13, 2008

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Origami: Folding Warps for Energy Efficient GPUs

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

Digital Design. Register Transfer Specification And Design

Next, we check the race condition to see if the circuit will work properly. Note that the minimum logic delay is a single sum.

Designing Single-Cycle MIPS Processor

Designing MIPS Processor

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

CprE 281: Digital Logic

Introduction to CMOS VLSI Design Lecture 1: Introduction

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits

ALU A functional unit

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

Chapter 7: Digital Components. Oregon State University School of Electrical Engineering and Computer Science. Review basic digital design concepts:

Transcription:

6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t F, t, t DM, t W } = t DM (probably) write -back Hover, CPI will increase unless instructions are pipelined Page 1

How to divide the datapath into s 6.823, L8--3 Suppose memory is significantly slor than other s. In particular, suppose t IM = t DM = 10 units t = 5 units t F = t W = 1 unit Since the slost determines the clock, it may be possible to combine some s without any loss of performance Minimizing Critical Path 6.823, L8--4 0 x4. I fetch decode & eg-fetch & execute t C > max {t IM, t F + t, t DM, t W } memory write -back Write-back takes much less time than other s. Suppose combined it with the memory increase the critical path by 10% Page 2

Speedup by Pipelining ignoring hazards 6.823, L8--5 For the 4- pipeline, given t IM = t DM = 10 units, t = 5 units, t F = t W = 1 unit t C could be reduced from 27 units to 10 units speedup = 2.7 Hover, if t IM = t DM = t = t F = t W = 5 units The same 4- pipeline can reduce t C from 25 units to 10 units speedup = 2.5 ut, since t IM = t DM = t = t F = t W, it is possible to achieve higher speedup with more s in the pipeline. 5- pipeline can reduce t C from 25 units to 5 units speedup = 5 n Ideal Pipeline 6.823, L8--6 1 2 3 4 ll objects go through the same s No sharing of resources beten any two s Propagation delay through all pipeline s is equal The scheduling of an object entering the pipeline is not affected by the objects in other s These conditions generally hold for industrial assembly lines. n instruction pipeline, hover, cannot satisfy the last condition. Why? Page 3

How ructions can Interact with each other in a pipeline 6.823, L8--7 n instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard n instruction may produce data that is needed by a later instruction data hazard In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...) Feedback to esolve Hazards 6.823, L8--8 F 1 F 2 F 3 F 4 1 2 3 4 Controlling pipeline in this manner works provided the instruction at i+1 can complete without any interference from instructions in s 1 to i (otherwise deadlocks may occur) Feedback to previous s is used to stall or kill instructions Page 4

Technology ssumptions 6.823, L8--9 We will assume small amount of very fast memory (caches) backed up by a large, slor memory Fast (at least for integers) Multiported egister files (slor!). It makes the following timing assumption valid t IM t F t t DM t W 5- pipelined Harvard architecture will be the focus of our detailed design 5-Stage Pipelined Execution 6.823, L8--10 I fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 M 1 W 1 instruction2 IF 2 ID 2 EX 2 M 2 W 2 instruction3 IF 3 ID 3 EX 3 M 3 W 3 instruction4 IF 4 ID 4 EX 4 M 4 W 4 instruction5 IF 5 ID 5 EX 5 M 5 W 5 Page 5

5-Stage Pipelined Execution esource Usage Diagram I 6.823, L8--11 fetch (IF) decode & eg-fetch (ID) execute (EX) memory (M) write -back (W) esources time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 4 I 5 ID I 1 I 2 I 4 I 5 EX I 1 I 2 I 4 I 5 M I 1 I 2 I 4 I 5 W I 1 I 2 I 4 I 5 Pipelined Execution: ructions 6.823, L8--12 not quite correct! Page 6

Pipelined Execution: Need for Several I s 6.823, L8--13 I I I Is and Control points 6.823, L8--14 I I I re control points connected properly? - Load/Store instructions - instructions Page 7

Pipelined Harvard path without interlocks and jumps 6.823, L8--15 egwrite I I I OpSel MemWrite egdst WSrc Sel Src Hardwired Control Equations: Harvard path - pipelined 6.823, L8--16 Sel = Case opcode D i, LW, SW, EQZ, NEZ s 16 ui u 16 J, JL s 26 Src = Case opcode D eg i, LW, SW OpSel = Case opcode E Func i Op LW, SW + EQZ, NEZ 0? MemWrite = Case opcode M SW on... off WSrc = Case opcode M, i LW Mem JL, JL egdst = Case opcode W rf3 i, LW rf2 JL, JL egwrite = Case opcode W, i, LW, JL, JL on... off Page 8

Hazards 6.823, L8--17 E M W I I I D... r1 (r0) + 10 r4 (r1) + 17... Oops! esolving Hazards 6.823, L8--18 1. Freeze earlier pipeline s until the data becomes available interlocks 2. If data is available somewhere in the datapath provide a bypass to get it to the right Page 9

6.823, L8--19 Interlocks to resolve Hazards Stall Condition E M W nop I I I D... r1 (r0) + 10 r4 (r1) + 17... Stalled Stages and Pipeline ubbles 6.823, L8--20 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 M 1 W 1 (I 2 ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 M 2 W 2 ( ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 M 3 W 3 (I 4 ) stalled s IF 4 ID 4 EX 4 M 4 W 4 (I 5 ) IF 5 ID 5 EX 5 M 5 W 5 esource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 4 I 5 EX I 1 nop nop nop I 2 I 4 I 5 M I 1 nop nop nop I 2 I 4 I 5 W I 1 nop nop nop I 2 I 4 I 5 nop pipeline bubble Page 10

Interlock Control Logic worksheet 6.823, L8--21 stall C stall rf1 rf2? E M W nop I I I D C dest Compare the source registers of the instruction in the decode with the destination register of the uncommitted instructions. Interlock Control Logic ignoring jumps & branches W stall W C stall M M rf1 E rf2 E re1 re2 C re nop 6.823, L8--22 C E dest M Cdest W I I I D C dest Should always stall if the rs field matches some rd? not every instruction writes registers not every instruction reads registers re Page 11

Source & Destination egisters 6.823, L8--23 -type: op rf1 rf2 rf3 func I-type: op rf1 rf2 immediate16 J-type: op immediate26 source(s) destination rf3 (rf1) func (rf2) rf1, rf2 rf3 i rf2 (rf1) op imm rf1 rf2 LW rf2 M [(rf1) + imm] rf1 rf2 SW M [(rf1) + imm] (rf2) rf1, rf2 Z cond (rf1) true: () + imm rf1 false: () + 4 rf1 J () + imm JL r (), () + imm J (rf1) rf1 JL r (), (rf1) rf1 Deriving the Stall Signal 6.823, L8--24 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW, JL, JL on... off C re re1 = Case opcode, i, on off re2 = Case opcode on off stall = Stall if the source registers of the instruction in the decode matches the destination register of the uncommitted instructions. Page 12

The Stall Signal 6.823, L8--25 C dest = Case opcode rf3 i, LW rf2 JL, JL = Case opcode, i, LW, JL, JL ( 0)... off C re re1 = Case opcode, i, LW, SW, Z, J, JL on J, JL off re2 = Case opcode, SW on... off stall stall = ( (rf1 = D E ). E + (rf1 D = M ). M + (rf1 D = W ). W ). re1 D + ((rf2 D = E ). E + (rf2 D = M ). M + (rf2 D = W ). W ). re2 D This is not the full story! Hazards due to Loads & Stores Stall Condition 6.823, L8--26 E M W nop I I I D... M[(r1)+7] (r2) r4 M[(r3)+5]... Is there any possible data hazard in this instruction sequence? Page 13

Hazards due to Loads & Stores depends on the memory system? 6.823, L8--27 E M W nop I I I D M[(r1)+7] (r2) (r1)+7 = (r3)+5 data hazard r4 M[(r3)+5] Hover, the hazard is avoided because... our memory system completes writes in one cycle! Complications due to Jumps stall 6.823, L8--28 nop E I M I I 1 I 2 ssuming no delay slot I 1 096 DD I 2 100 J 200 104 DD I 4 304 DD kill Page 14