Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Similar documents
EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

3. (2) What is the difference between fixed and hybrid instructions?

Computer Architecture ELEC2401 & ELEC3441

CSCI-564 Advanced Computer Architecture

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

4. (3) What do we mean when we say something is an N-operand machine?

CPU DESIGN The Single-Cycle Implementation

Design. Dr. A. Sahu. Indian Institute of Technology Guwahati

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

A Second Datapath Example YH16

CSE Computer Architecture I

Project Two RISC Processor Implementation ECE 485

CMP N 301 Computer Architecture. Appendix C

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Computer Architecture

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary?

COVER SHEET: Problem#: Points

Processor Design & ALU Design

L07-L09 recap: Fundamental lesson(s)!

Simple Instruction-Pipelining. Pipelined Harvard Datapath

EC 413 Computer Organization

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

ICS 233 Computer Architecture & Assembly Language

CPSC 3300 Spring 2017 Exam 2


ENEE350 Lecture Notes-Weeks 14 and 15

Lecture: Pipelining Basics

Lecture 3, Performance

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Lecture 3, Performance

/ : Computer Architecture and Design

Lecture 13: Sequential Circuits, FSM

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Simple Instruction-Pipelining (cont.) Pipelining Jumps

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Unit 6: Branch Prediction

Lecture 13: Sequential Circuits, FSM

Instruction register. Data. Registers. Register # Memory data register

Designing Single-Cycle MIPS Processor

TEST 1 REVIEW. Lectures 1-5

Department of Electrical and Computer Engineering The University of Texas at Austin

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Microprocessor Power Analysis by Labeled Simulation

61C In the News. Processor Design: 5 steps

Fall 2011 Prof. Hyesoon Kim

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Designing MIPS Processor

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CA Compiler Construction

Implementing the Controller. Harvard-Style Datapath for DLX

Computer Architecture

Basic Computer Organization and Design Part 3/3

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Design of Digital Circuits Lecture 14: Microprogramming. Prof. Onur Mutlu ETH Zurich Spring April 2017

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

COMP303 Computer Architecture Lecture 11. An Overview of Pipelining

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

EE 660: Computer Architecture Out-of-Order Processors

Enrico Nardelli Logic Circuits and Computer Architecture

CMP 338: Third Class

DETERMINING THE VARIABLE QUANTUM TIME (VQT) IN ROUND ROBIN AND IT S IMPORTANCE OVER AVERAGE QUANTUM TIME METHOD

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Combinational vs. Sequential. Summary of Combinational Logic. Combinational device/circuit: any circuit built using the basic gates Expressed as

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Topics: A multiple cycle implementation. Distributed Notes

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Review. Combined Datapath

CMP 334: Seventh Class

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13

Latches. October 13, 2003 Latches 1

CPS 104 Computer Organization and Programming Lecture 11: Gates, Buses, Latches. Robert Wagner

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

2

Figure 4.9 MARIE s Datapath

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

Arithmetic and Logic Unit First Part

Automata Theory CS S-12 Turing Machine Modifications

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

ALU A functional unit

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Review: Single-Cycle Processor. Limits on cycle time

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

Outcomes. Spiral 1 / Unit 2. Boolean Algebra BOOLEAN ALGEBRA INTRO. Basic Boolean Algebra Logic Functions Decoders Multiplexers

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Compiling Techniques

Introduction The Nature of High-Performance Computation

CS/COE0447: Computer Organization

Counters. We ll look at different kinds of counters and discuss how to build them

Binary addition example worked out

Ch 7. Finite State Machines. VII - Finite State Machines Contemporary Logic Design 1

CS/COE1541: Introduction to Computer Architecture. Logic Design Review. Sangyeun Cho. Computer Science Department University of Pittsburgh

Logic Design II (17.342) Spring Lecture Outline

Transcription:

Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1

Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 CS 365 3 Basic Ideas Do not wait for an instruction to complete to start the next. Start the Cycle 0 of the next instruction when the previous one enters Cycle 1. Instruction executions are overlapped. Pipelining increases instruction throughput, as opposed to decreasing the execution time of individual instructions. CS 365 4 2

Easier Said Than Done? In every cycle, activities of all five stages take place. Many problems arises with overlapped executions. Structural hazards Control hazards Data hazards CS 365 5 Pipeline Hazards I Structural Hazards The data path cannot support the combination of instructions that we want to execute in the same cycle Consider what happens when an R-type instruction is followed by a BEQ? add beq RR + RW RR CS 365 6 3

Summary of MIPS Lite Instruction Executions Step name Instruction fetch Instruction decode/register fetch Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Action for jumps Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B Memory read completion Load: Reg[IR[20-16]] = MDR CS 365 7 Resource Conflicts on the Multicycle Datapath Io rd M e m R e a d M e m W r i te IR W r it e R e g D s t R e g W r ite A L U S rc A P C 0 M u x 1 A d d r e s s M e m o r y M e m D a ta W rite d a t a In s tru c tio n [2 5 2 1 ] In s tru c tio n [2 0 1 6 ] In s tru c tio n [1 5 0 ] I n s tr u c tio n re g is te r 0 M I n s tr u c tio n u [ 1 5 1 1 ] x 1 R e a d r e g i s te r 1 R e a d R e a d d a t a 1 r e g i s te r 2 R e g is te r s W r ite R e a d r e g i s te r d a ta 2 W r ite d a t a A B 4 0 M u x 1 0 1 M u 2 x Z e ro A L U A L U r e s u l t A L U O u t In s tr u c tio n 0 3 [1 5 0 ] M u x M e m o ry 1 d a ta r e g is te r 1 6 S ig n e x t e n d 3 2 S h ift le ft 2 A L U c o n t ro l In s tr u c tio n [5 0 ] M e m to R e g CS 365 8 A L U S r c B A L U O p 4

Pipeline Hazards II Control Hazards: When we decide to branch, other instructions are in the pipeline! beq add sub RR RR + RW RR + RW Target: ld RR + MR RW Which one is the next? CS 365 9 Pipeline Hazards III Data Hazards: (data dependencies) an instruction depends on the result of a previous instruction still in the pipeline. Writing new value of r1 Add r1, r2, r3 Sub r4, r1, r10 RR + RW RR Reading new value of r1, not available yet CS 365 10 5

Lessons To achieve pipelining and avoid hazards, we need to redesign the datapath and instruction execution steps, CS 365 11 Pipeline Stages R-Type LD ST BEQ J Stage 1: IR Mem[PC] PC PC + 4 Stage 2: RR Read Reg[rs] and Reg[rt] Stage 3: EX (use ALU) RS op RT Calculate RS+Immd Calculate RS+Immd Calculate PC + Immd Compare RS and RT Set PC to Immd Stage 4: DM Read memory Write memory Set PC accordingly Stage 5: RW/WB Write to Rd Write to Rt CS 365 12 6

Graphically Representing Pipelines T i m e ( i n c l o c k c y c l e s ) P r o g r a m e x e c u t io n C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 o r d e r ( i n i n s t r u c t i o n s ) l w $ 1 0, 2 0 ( $ 1 ) I M R e g A L U D M R e g s u b $ 1 1, $ 2, $ 3 I M R e g A L U D M R e g CS 365 13 0 M u x Solving Structural Hazards: Pipelined Datapath 1 I F /I D I D /E X E X / M E M M E M / W B A d d 4 A d d A d d r e s u lt S h if t l e ft 2 R e a d P C A d d r e s s I n s tr u c t io n m e m o r y I n s t r u c t i o n r e g is t e r 1 R e a d d a t a 1 R e a d r e g is t e r 2 R e g i s te r s R e a d W r it e d a t a 2 r e g is t e r W r it e d a t a 0 M u x 1 Z e r o A L U A L U r e s u lt A d d re s s D a t a m e m o r y W r i te R e a d d a t a 1 M u x 0 d a t a 1 6 S ig n e x t e n d 3 2 CS 365 14 7

Discussions Make sure you understand why two extra adders are added. Is the assumption of using separate instruction and data memory reasonable? CS 365 15 Consider Control Hazard beq Target: When is the decision made? CS 365 16 8

Solution 1: Stalling The Pipeline OR Decision is in CS 365 17 Solution 2: Branch Prediction Make a guess about the branch decision and start execute the guessed path before the decision is in (aka speculative execution). If guess was wrong, abandon those in the pipeline and jump to the right target. Branch Prediction Strategies Predict branch fails Predict branch succeeds Look into history CS 365 18 9

Predict Branch Fails We guess the branch will fail. That is, the next will fetch the next sequential execution. If guess is right, just proceed. If guess is wrong, abandon sequential instructions and fetch the instruction from the target address. CS 365 19 beq Predict branch fails: Guess Is Right Following instructions Decision is in; Do not branch CS 365 20 10

beq Following instructions RR EX Target instructions RR Predict branch fails: Guess Is Wrong CS 365 21 Decision is in: Do branch Discussion Notice that programmers are not aware of branch predictions; right or wrong guesses affect only performance. Delayed Branches: Make it official that CPU always executes the instr following a branch. The branch determines the next next instruction. Notice the programmer awareness CS 365 22 11

Example BEQ r1, r2, target add r10, r11, r12 add r20, r21, r22 sub r30, r10, r20 Delay slot, executed regardless of the branch decision Target: sub r30, r10, r20 CS 365 23 Delayed Branch with Wrong Guess beq Following instructions RR EX Target instructions RR DM RW CS 365 24 Decision is in: Do branch Delay slot is always finished 12

Discussions Compilers/programmers must be smart enough to make good use of the delay slots. The problem is not entirely solved: We still need to stall before the decision comes in. The situations are exacerbated by deeper pipelining. Branch predictions are still important. CS 365 25 Smart Branch Predictions Observations Branches of if-else statements are hard to predict. Branches of loops typically repeat previous decisions. For performance, loops are more important than other control structures. Many modern processors use specialpurpose hardware to remember the targets of recent branch instructions. CS 365 26 13

Recall Data Hazards Sub $2, $1, $3 And $12, $2, $5 Or $13, $6, $2 Add $14, $2, $2 Sw $15, 100($2) $2 available CS 365 27 Solution 1: Stalling Postpone subsequent instructions until data is available Simple but inefficient Sub RR EX RW And RR EX RW Or RR EX RW Add RR EX Sw RR EX RW RW CS 365 28 14

Solution 2: Internal Forwarding New value of $2 is available after the EX of Sub, but not in $2 yet Use special circuits to forward the new value to subsequent instructions Sub r2, r1, r3 Sub RR EX RW And r12, r2, r5 And RR EX RW Or r13, r6, r2 Or RR EX RW Add r14, r2, r2 Add RR EX RW Sw r15, 100(r2) Sw RR EX DM CS 365 29 Types of Forwarding ALU forwarding: forward an ALU output to subsequent instructions This is the case we have seen Memory Forwarding: forward a memory output (ld result) to subsequent instructions. CS 365 30 15

Memory Forwarding Ld $1, 4($2) ld Or $10, $1, $3 Or RR EX RW Sub $20, $1, $10 Sub RR EX RW Notice that we still have to stall for one cycle CS 365 31 Delayed Loads Make it official in the ISA that the result of a load will not be available to the next instruction. Have the compiler find something useful to do in load slots. (by reordering) CS 365 32 16

MIPS Solution Delayed result with internal forwarding. Ld Or RR EX RW Sub RR EX RW Nop or something useful CS 365 33 Summary Three types of pipeline hazards Structural hazards Control hazards Data hazards Compilers/programmers could reorder the code to avoid hazards and eliminate bubbles. Ideal cases: no bubbles; one instr per cycle. Stalling (bubbles) is the last resort but must be supported by hardware. CS 365 34 17

Registers Exercise: Code Reordering for (i=0; i<n; i++) z[i] = x[i] + y[i]; r1 points to x[i] r2 points to y[i] r3 points to z[i] r4 holds x[i] r5 holds y[i] r6 holds z[i] r7 holds i r8 holds N CS 365 35 loop: lw lw r4, 0(r1) r5, 0(r2) add r6, r4, r5 sw r6, 0(r3) addi r1,r1,4 addi r2,r2,4 addi r3,r3,4 addi r7,r7,1 bne r7,r8, loop CS 365 36 18

CS 365 37 loop: lw r4, 0(r1) lw r5, 0(r2) exit: add r6, r4, r5 sw r6, 0(r3) addi r7,r7,1 beq r7,r8, exit lw r4, 4(r1) lw r5, 4(r2) add r6, r4, r5 sw r6, 4(r3) addi r1,r1,8 addi r2,r2,8 addi r3,r3,8 addi r7,r7,1 bne r7,r8, loop Loop Unrolling 19