EE 660: Computer Architecture Out-of-Order Processors

Similar documents
CSCI-564 Advanced Computer Architecture

Computer Architecture ELEC2401 & ELEC3441

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)


This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

CPSC 3300 Spring 2017 Exam 2

Simple Instruction-Pipelining. Pipelined Harvard Datapath

4. (3) What do we mean when we say something is an N-operand machine?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

3. (2) What is the difference between fixed and hybrid instructions?

Simple Instruction-Pipelining. Pipelined Harvard Datapath

CMP N 301 Computer Architecture. Appendix C

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary?

Computer Architecture. ESE 345 Computer Architecture. Design Process. CA: Design process

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ICS 233 Computer Architecture & Assembly Language

/ : Computer Architecture and Design

Fall 2011 Prof. Hyesoon Kim

Project Two RISC Processor Implementation ECE 485

Portland State University ECE 587/687. Branch Prediction

CS 152 Computer Architecture and Engineering. Lecture 17: Synchronization and Sequential Consistency

Implementing the Controller. Harvard-Style Datapath for DLX

Unit 6: Branch Prediction

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

CS 152 Computer Architecture and Engineering. Lecture 17: Synchroniza<on and Sequen<al Consistency. Last Time, Lecture 16: GPUs. NOW Handout Page 1

Microprocessor Power Analysis by Labeled Simulation

Lecture: Pipelining Basics

Performance Prediction Models

Compiling Techniques

Scalable Store-Load Forwarding via Store Queue Index Prediction

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

UNIVERSITY OF WISCONSIN MADISON

Department of Electrical and Computer Engineering The University of Texas at Austin

Branch Prediction using Advanced Neural Methods

Lecture 3, Performance

Lecture 3, Performance

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Computer Architecture

Performance, Power & Energy

Mothers of Pipelines

ENEE350 Lecture Notes-Weeks 14 and 15

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

Circuit Modeling for Practical Many-core Architecture Design Exploration

Processor Design & ALU Design

Arithmetic and Logic Unit First Part

Building a Computer. Quiz #2 on 10/31, open book and notes. (This is the last lecture covered) I wonder where this goes? L16- Building a Computer 1

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

ALU A functional unit

CSE Computer Architecture I

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

CS 700: Quantitative Methods & Experimental Design in Computer Science

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

A Second Datapath Example YH16

Clock-driven scheduling

CMP 338: Third Class

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

CODE GENERATION REGISTER ALLOCATION. Goal. Interplay between. Translate intermediate code into target code

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

Implementing Absolute Addressing in a Motorola Processor (DRAFT)

Introduction to CMOS VLSI Design Lecture 1: Introduction

L07-L09 recap: Fundamental lesson(s)!

Spiral 1 / Unit 3

COVER SHEET: Problem#: Points

CA Compiler Construction

CIS 371 Computer Organization and Design

Transposition Mechanism for Sparse Matrices on Vector Processors

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Hardware Acceleration of DNNs

Review. Combined Datapath

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

Part I: Definitions and Properties

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

Energy Delay Optimization

Numbers and Arithmetic

Numbers and Arithmetic

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

CSD Univ. of Crete Fall 2017 TUTORIAL ON EXTERNAL SORTING

EC 413 Computer Organization

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Designing Single-Cycle MIPS Processor

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Transcription:

EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 2

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 3

Out-Of-Order (OOO) Introduction Name Frontend Issue riteback Commit I4 IO IO IO IO Fixed Length Pipelines Scoreboard I2O2 IO IO OOO OOO Scoreboard I2OI IO IO OOO IO Scoreboard, Reorder Buffer, and Store Buffer I03 IO OOO OOO OOO Scoreboard and Issue Queue IO2I IO OOO OOO IO Scoreboard, Issue Queue, Reorder Buffer, and Store Buffer 4

OOO Motivating Code Sequence 0 2 1 4 3 5 6 Two independent sequences of instructions enable flexibility in terms of how instructions are scheduled in total order e can schedule statically in software or dynamically in hardware 5

I4: In-Order Front-End, Issue, riteback, Commit F D X M 6

I4: In-Order Front-End, Issue, riteback, Commit X0 X1 F D M0 M1 7

I4: In-Order Front-End, Issue, riteback, Commit (4-stage MUL) X0 X1 X2 X3 F D X2 X3 M0 M1 Y0 Y1 Y2 Y3 To avoid increasing CPI, needs full bypassing which can be expensive. To help cycle time, add Issue stage where register file read and instruction issued to Functional Un 18 it

I4: In-Order Front-End, Issue, riteback, Commit (4-stage MUL) SB X0 X1 X2 X3 ARF F D I X2 X3 M0 M1 Y0 Y1 Y2 Y3 ARF R SB R/ 19

Basic Scoreboard R1 R2 R3 R31 Data Avail. P F 4 3 2 1 0 P: Pending, rite to Destination in flight F: hich functional unit is writing register Data Avail.: here is the write data in the functional unit pipeline A One in Data Avail. In column I means that result data is in stage I of functional unit F Can use F and Data Avail. fields to determine when to bypass and where to bypass from A one in column zero means that cycle functional unit is in the riteback stage Bits in Data Avail. field shift right every cycle. 10

Basic Scoreboard Data Avail. P F 4 3 2 1 0 R1 1 R2 R3 R31 P: Pending, rite to Destination in flight F: hich functional unit is writing register Data Avail.: here is the write data in the functional unit pipeline A One in Data Avail. In column I means that result data is in stage I of functional unit F Can use F and Data Avail. fields to determine when to bypass and where to bypass from A one in column zero means that cycle functional unit is in the riteback stage Bits in Data Avail. field shift right every cycle. 11

0 MUL R1, R2, R3 F 1 ADDIU R11,R10,1 2 MUL R5, R1, R4 3 MUL R7, R5, R6 4 ADDIU R12,R11,1 5 ADDIU R13,R12,1 6 ADDIU R14,R12,2 D I Y0 Y1 Y2 Y3 F D I X0 X1 X2 X3 F D I I I Y0 Y1 Y2 Y3 F D D D I I I I Y0 Y1 Y2 Y3 F F F D D D D I X0 X1 X2 X3 F F F F D I X0 X1 X2 X3 F D I X0 X1 X2 X3 Cyc D I 1 0 2 1 0 3 2 1 4 5 6 3 2 7 8 9 10 4 3 11 5 4 12 6 5 13 6 14 15 16 17 18 4 3 2 1 0 Dest Regs 1 R1 1 1 R11 1 1 1 1 1 1 1 R5 1 1 1 1 1 1 R7 1 1 R12 1 1 1 R13 1 1 1 1 R14 1 1 1 1 1 1 1 1 1 1 RED Indicates if we look at F Field, we can bypass on this cycle 22

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 1 3

I2O2: In-order Frontend/Issue, Out-oforder riteback/commit SB X0 ARF F D I M0 M1 Y0 Y1 Y2 Y3 ARF R SB R R/ 14

I2O2 Scoreboard Similar to I4, but we can now use it to track structural hazards on riteback port Set bit in Data Avail. according to length of pipeline Architecture conservatively stalls to avoid A hazards by stalling in Decode therefore current scoreboard sufficient. More complicated scoreboard needed for processing A Hazards 15

0 MUL R1, R2, R3 F 1 ADDIU R11,R10,1 2 MUL R5, R1, R4 3 MUL R7, R5, R6 4 ADDIU R12,R11,1 5 ADDIU R13,R12,1 6 ADDIU R14,R12,2 D I Y0 Y1 Y2 Y3 F D I X0 F D I I I Y0 Y1 Y2 Y3 F D D D I I I I Y0 Y1 Y2 Y3 F F F D D D D I X0 F F F F D I X0 F D I I X0 Cyc D I 1 0 2 1 0 3 2 1 4 5 6 7 8 9 4 3 2 1 0 Dest Regs 1 R1 1 1 R11 1 1 3 2 1 1 1 R5 1 1 10 4 3 1 11 5 4 1 1 R7 12 6 5 1 1 R12 13 1 1 1 R13 14 6 1 1 15 1 1 R15 16 1 17 18 RED Indicates if we look at F Field, we can bypass on this cycle rites with two cycle latency. Structural Hazard 25

Early Commit Point? 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 / 1 ADDIU R11,R10,1 F D I X0 / 2 MUL R5, R1, R4 F D I I I / 3 MUL R7, R5, R6 F D D D / 4 ADDIU R12,R11,1 F F F / 5 ADDIU R13,R12,1 / 6 ADDIU R14,R12,2 Limits certain types of exceptions. 26

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 1 8

F I2OI: In-order Frontend/Issue, Out-oforder riteback, In-order Commit D SB X0 I L0 L1 S0 Y0 Y1 Y2 Y3 PRF ARF ROB C FSB ARF SB PRF ROB FSB R/ R R/ PRF=Physical Register File(Future File), ROB=Reorder Buffer, FSB=Finished Store Buffer (1 entry) R/ R/ 27

State S ST V Preg -- P 1 F 1 P 1 P F P P -- -- Reorder Buffer (ROB) State: {Free, Pending, Finished} S: Speculative ST: Store bit V: Physical Register File Specifier Valid Preg: Physical Register File Specifier 20

State S ST V Preg -- P 1 F 1 P 1 P F P P -- -- State: {Free, Pending, Finished} S: Speculative ST: Store bit Reorder Buffer (ROB) Next instruction allocates here in D Tail of ROB Speculative because branch is in flight Instruction wrote ROB out of order Head of ROB Commit stage is waiting for Head of ROB to be finished V: Physical Register File Specifier Valid Preg: Physical Register File Specifier 21

Finished Store Buffer (FSB) V Op Addr Data -- Only need one entry if we only support one memory instruction inflight at a time. Single Entry FSB makes allocation trivial. If support more than one memory instruction, we need to worry about Load/Store address aliasing. 30

0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 C 1 ADDIU R11,R10,1 F D I X0 r C 2 MUL R5, R1, R4 F D I I I Y0 Y1 Y2 Y3 C 3 MUL R7, R5, R6 F D D D I I I I Y0 Y1 Y2 Y3 C 4 ADDIU R12,R11,1 F F F D D D D I X0 r C 5 ADDIU R13,R12,1 F F F F D I X0 r C 6 ADDIU R14,R12,2 F D I I X0 r C Cyc 0 1 2 3 4 5 6 7 8 9 D I 10 4 3 11 5 4 12 6 5 13 14 15 16 17 18 19 ROB 0 1 2 3 0 1 0 R1 2 1 R11 3 2 R11 R1 R12 6 R12 R13 R13 R5 R5 R14 R14 R7 R7 Empty = free entry in ROB State of ROB at beginning of cycle Pending entry inrob Circle=Finished (Cycle after ) Last cycle before entry is freed from ROB (Cycle in C stage) Entry becomes free and is freed on next cycle 31

hat if First Instruction Causes an Exception? 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 / 1 ADDIU R11,R10,1 F D I X0 r -- / 2 MUL R5, R1, R4 F D I I I Y0 / 3 MUL R7, R5, R6 F D D D I / 4 ADDIU R12,R11,1 F F F D / F D I... 24

hat About Branches? Option 2 0 BEQZ R1, target F D I X0 C 1 ADDIU R11,R10,1 F D I X0 / 2 ADDIU R5, R1, R4 F D I / 3 ADDIU R7, R5, R6 F D / T ADDIU R12,R11,1 F D I... Squash instructions in ROB when Branch commits Option 1 0 BEQZ R1, target F D I X0 C 1 ADDIU R11,R10,1 F D I - 2 ADDIU R5, R1, R4 F D - 3 ADDIU R7, R5, R6 F - T ADDIU R12,R11,1 F D I... Squash instructions earlier. Has more complexity. ROB needs many ports. Option 3 0 BEQZ R1, target F 1 ADDIU R11,R10,1 2ADDIU R5, R1, R4 3ADDIU R7, R5, R6 T ADDIU R12,R11,1 D I X0 C F D I X0 / F D F I X0 / D I X0 / F D I X0 C ait for speculative instructions to reach the Commit stage and squash in Commit stage 25

hat About Branches? Three possible designs with decreasing complexity based on when to squash speculative instructions and de-allocate ROB entry: 1. As soon as branch resolves 2. hen branch commits 3. hen speculative instructions reach commit Base design only allows one branch at a time. Second branch stalls in decode. Can add more bits to track multiple in-flight branches. 26

Avoiding Stalling Commit on Store Miss PRF ARF ROB C CSB R FSB 0 OpA F D I X0 C CSB=Committed Store Buffer 1 S F D I S0 C C C C 2 OpB F D I X0 C 3 OpC F D I X X X X C 4 OpD F D I I I I X C ith Retire Stage 0 OpA F D I X0 C 1 S F D I S0 C R R R 2 OpB F D I X0 C 3 OpC F D I X C 4 OpD F D I X C 35

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 2 8

IO3: In-order Frontend, Out-of-order Issue/riteback/Commit SB X0 ARF F D I Q I M0 M1 Y0 Y1 Y2 Y3 ARF SB I Q R R R/ R/ 36

Issue Queue (IQ) Op Imm S V Dest V P Src0 V P Src1 Op: Opcode Imm.: Immediate S: Speculative Bit V: Valid (Instruction has corresponding Src/Dest) P: Pending (aiting on operands to be produced) Instruction Ready = (!Vsrc0!Psrc0) && (!Vsrc1!Psrc1) && no structural hazards For high performance, factor in bypassing 30

Centralized vs. Distributed Issue Queue X0 I Q A I X0 F D I Q I M0 F D M0 Y0 I Q B I Y0 Centralized Distributed 31

Advanced Scoreboard R1 R2 R3 R31 Data Avail. P 4 3 2 1 0 P: Pending, rite to Destination in flight Data Avail.: here is the write data in the pipeline and which functional unit Data Avail. now contains functional unit identfier A non-empty value in column zero means that cycle functional unit is in the riteback stage Bits in Data Avail. field shift right every cycle. 32

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 1 ADDIU R11,R10,1 F D I X0 2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 4 ADDIU R12,R11,1 F D i I X0 5 ADDIU R13,R12,1 F D i I X0 6 ADDIU R14,R12,2 F D i I X0 Cyc 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 D I 0 1 0 2 1 3 4 5 2 6 4 IQ 0 1 2 R1/R2/R3 R11/R10 R5/R1/R4 R7/R5/R6 R12/R11 R13/R12 5 R14/R12 3 6 R14/R12 Dest/Src0/Src1, Circle denotes value present in ARF Value bypassed so no circle, present bit Value set present by Instruction 1 in cycle 5, Stage 40

Assume All Instruction in Issue Queue 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 MUL R1, R2, R3 F D i I Y0 Y1 Y2 Y3 1 ADDIU R11,R10,1 F D i I X0 2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 4 ADDIU R12,R11,1 F D i I X0 5 ADDIU R13,R12,1 F D i I X0 6 ADDIU R14,R12,2 F D i I X0 Better performance than previous? 41

Agenda I4 Processors I2O2 Processors I2OI Processors IO3 Processors IO2I Processors 3 5

F IO2I: In-order Frontend, Out-of-order Issue/riteback, In-order Commit D I Q SB X0 I L0 L1 S0 Y0 Y1 Y2 Y3 PRF ROB FSB ARF C ARF SB PRF ROB FSB IQ R/ R/ R R/ R/ R/ 42

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 C 1 ADDIU R11,R10,1 F D I X0 r C 2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 C 3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 C 4 ADDIU R12,R11,1 F D i I X0 r C 5 ADDIU R13,R12,1 F D i I X0 r C 6 ADDIU R14,R12,2 F D i I X0 r C 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 C 1 ADDIU R11,R10,1 F D I X0 r C 2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 C 3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 C 4 ADDIU R12,R11,1 F D i I X0 r C 5 ADDIU R13,R12,1 F D i I X0 r C 6 ADDIU R14,R12,2 F D i I X0 r C 37

Out-of-order 2-ide Superscalar with 1 ALU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 MUL R1, R2, R3 F D I Y0 Y1 Y2 Y3 C 1 ADDIU R11,R10,1 F D I X0 r C 2 MUL R5, R1, R4 F D i I Y0 Y1 Y2 Y3 C 3 MUL R7, R5, R6 F D i I Y0 Y1 Y2 Y3 C 4 ADDIU R12,R11,1 F D I X0 r C 5 ADDIU R13,R12,1 F D i I X0 r C 6 ADDIU R14,R12,2 F D i I X0 r C 38

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) Christopher Batten (Cornell) MIT material derived from course 6.823 UCB material derived from course CS252 & CS152 Cornell material derived from course ECE 4750 39

Copyright 2013 David entzlaff 40