CMP N 301 Computer Architecture. Appendix C

Similar documents
ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Lecture: Pipelining Basics

ICS 233 Computer Architecture & Assembly Language

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

COMP303 Computer Architecture Lecture 11. An Overview of Pipelining

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath

CPSC 3300 Spring 2017 Exam 2

CSCI-564 Advanced Computer Architecture

Computer Architecture

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

3. (2) What is the difference between fixed and hybrid instructions?

Microprocessor Power Analysis by Labeled Simulation

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1


1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

ENEE350 Lecture Notes-Weeks 14 and 15

[2] Predicting the direction of a branch is not enough. What else is necessary?

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Processor Design & ALU Design

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

4. (3) What do we mean when we say something is an N-operand machine?

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Computer Architecture ELEC2401 & ELEC3441

[2] Predicting the direction of a branch is not enough. What else is necessary?

Project Two RISC Processor Implementation ECE 485

Fall 2011 Prof. Hyesoon Kim

EE 660: Computer Architecture Out-of-Order Processors

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Performance, Power & Energy

Unit 6: Branch Prediction

Implementing the Controller. Harvard-Style Datapath for DLX

L07-L09 recap: Fundamental lesson(s)!

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

CS 700: Quantitative Methods & Experimental Design in Computer Science

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Goals for Performance Lecture

/ : Computer Architecture and Design

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

COVER SHEET: Problem#: Points

Designing Sequential Logic Circuits

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

CPU DESIGN The Single-Cycle Implementation

Lecture 13: Sequential Circuits, FSM

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Vector Lane Threading

CMP 338: Third Class

Lecture 13: Sequential Circuits, FSM

Lecture 2: Metrics to Evaluate Systems

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Fall 2008 CSE Qualifying Exam. September 13, 2008

CMP 334: Seventh Class

Scheduling I. Today Introduction to scheduling Classical algorithms. Next Time Advanced topics on scheduling

Enrico Nardelli Logic Circuits and Computer Architecture

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Scheduling I. Today. Next Time. ! Introduction to scheduling! Classical algorithms. ! Advanced topics on scheduling

Scalable Store-Load Forwarding via Store Queue Index Prediction

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

Review: Single-Cycle Processor. Limits on cycle time

Control. Control. the ALU. ALU control signals 11/4/14. Next: control. We built the instrument. Now we read music and play it...

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Last class: Today: Threads. CPU Scheduling

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Portland State University ECE 587/687. Branch Prediction

Lecture 3, Performance

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

EC 413 Computer Organization

Simulation of Process Scheduling Algorithms

Module 5: CPU Scheduling

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Chapter 6: CPU Scheduling

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

CPE100: Digital Logic Design I

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Scheduling. Uwe R. Zimmer & Alistair Rendell The Australian National University

Introduction The Nature of High-Performance Computation

Circuit Modeling for Practical Many-core Architecture Design Exploration

High Performance Computing

Lecture 3, Performance

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Department of Electrical and Computer Engineering The University of Texas at Austin

Basic Computer Organization and Design Part 3/3

Origami: Folding Warps for Energy Efficient GPUs

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

CPU SCHEDULING RONG ZHENG

Synchronous Elastic Systems

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

CSE Computer Architecture I

Computer Architecture

Transcription:

CMP N 301 Computer Architecture Appendix C

Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc) 2

Pipelining: Introduction Implementation technique in which multiple instructions are overlapped in execution (instruction execution steps could run in parallel You have 4 loads of cloths to wash: Steps (stages) required: Wash Dry Fold Store clothes into drawers A B C D Each stage needs 30 minutes We can t start the next step until the previous step is finished 3

Pipelining Example: Laundry There are 2 approaches to do this job: Sequential (non-pipelined): Wait until the first load is put away in order to start the next load Pipelined (ASAP): As soon as the washer is empty, start putting the next load, while the first load is put into dryer 4

Pipelining Example: Laundry Sequential Laundry Needs 8 hours for 4 loads T i m e T a s k o r d e r A 6 P M 7 8 9 10 11 12 1 2 A M B C D 5

Pipelining Example: Laundry Pipelined Laundry: Start work ASAP Needs only 3.5 hours for 4 loads! 6 P M 7 8 9 10 1 12 1 2 A M T i m e T a s k o r d e r A B C D 6

CPU Pipelining Review: 5 stages of a MIPS instruction Fetch instruction from instruction memory Read registers while decoding instruction Execute operation or calculate address, depending on the instruction type Access an operand from data memory Write result into a register We can reduce the cycles to fit the stages. Do you catch an advantage of load/stor RISC architecture over CISC? Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load /Dec Exec Mem Wr 7

CPU Pipelining isters should be added to separate between stages and stabilize the signals passed from one stage to the next 8

Pipelining Example: Observations Pipelining Observations: After filling the pipeline, all stages will be operating concurrently Pipelining doesn t reduce number of stages doesn t help latency of single task helps throughput of entire workload In order to pipeline the task, we should have separate resources. (Multiple tasks operating simultaneously use different resources) 9

Pipelining: Speedup Speedup due to pipelining depends on the number of stages in the pipeline Ideal maximum speedup = number of stages (pipeline depth) Why ideal speedup never been achieved Stages are usually not balanced and pipeline rate is limited by slowest pipeline stage If dryer needs 45 min, time for all stages has to be 45 min to accommodate it Pipelining overheads (like inter-stage registers delay) Time to fill the pipeline and time to drain it If one load depends on another, we will have to wait (Delay/Stall for Dependencies or Hazards) 10

CPU Pipelining: Examples Example (1): Textbook Page C-10 Example (2): Single-Cycle, non-pipelined execution Total time for 3 instructions: 24 ns P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w $ 1, 1 0 0 ( $ 0 ) l w $ 2, 2 0 0 ( $ 0 ) l w $ 3, 3 0 0 ( $ 0 ) I n s t r u c t i o n f e t c h 2 4 6 8 1 0 1 2 1 4 1 6 1 8 R e g A L U 8 n s D a t a a c c e s s R e g I n s t r u c t i o n f e t c h R e g A L U 8 ns D a t a a c c e s s R e g I n s t r u c t i o n f e t c h... 8 n s 11

CPU Pipelining: Example Single-cycle, pipelined execution Improve performance by increasing instruction throughput Total time for 3 instructions = 14 ns Each instruction adds 2 ns to total execution time Stage time limited by slowest resource (2 ns) Assumptions: Write to register occurs in 1 st half of clock Read from register occurs in 2 nd half of clock P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e l w $ 1, 1 0 0 ( $ 0 ) I n s t r u c t i o n f e t c h 2 4 6 8 1 0 1 2 1 4 R e g A L U D a t a a c c e s s R e g l w $ 2, 2 0 0 ( $ 0 ) 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g l w $ 3, 3 0 0 ( $ 0 ) 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g 2 n s 2 n s 2 n s 2 n s 2 n s 12

Pipelining Hazards Hazards: Situations that prevent an instruction from being executed in its designated clock cycle Types of Hazards 1. Structural Hazards: Conflict on resources 2. Data Hazards: An instruction depends on the results of previous instruction 3. Control Hazards: When an instruction changes PC, like branches The simplest hazards solution is to stall the pipeline (some instructions are allowed to proceed, while other are delayed) Speedup = pipeline depth/(1+average stall cycles per instruction)

Structural hazards No two instructions are processed by the same module at the same time. Solutions: Stalling the pipeline (inserting bubbles) or duplicating the resource Instruction fetch is conflicting with data memory access Use Harvard architecture with two separate caches ister file is accessed by the instruction in the decode (reading) and the instruction in the write-back (writing) Ensure that writing is done in the first half of the clock and reading is done at the second half of the clock

Solving Structural Hazards by Stalling Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 Structural Hazard 15

Solving Structural Hazards by Stalling Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble 16

Data Hazards Pipelining might change the order of reading/writing operands from that of sequential execution Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Types of Data Hazards Read After Write (RAW): Instruction J tries to read an operand before Instruction I writes it (True data dependency) I: add r1,r2,r3 J: sub r4,r1,r3 Write After Read (WAR): Instruction J writes an operand before Instruction I reads it (Anti dependency) I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Write After Write (WAW): Instruction J writes an operand before Instruction I writes it (output dependency) I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Data Hazards Solutions Writing in the first half of the cycle and reading in the second half mitigates the effect of data hazards Solution (1): Stalling the pipeline: inserting bubbles and stall the instruction till the data is written back Solution (2): Data Forwarding or bypassing : forward the data back to the requesting stage immediately when it is available Solution (3): Software-solution: Compiler arranges instructions to avoid data dependencies

Solving Data Hazards by Forwarding Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 20

HW Change for Forwarding NextPC isters Immediate ID/EX mux mux EX/MEM Data Memory MEM/WR mux 21

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9 22

Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble or r8,r1,r9 Bubble Pipeline should be stalled till the data becomes available 23

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,ra SUB Rd,Re,Rf SW d,rd 24