Lecture: Pipelining Basics

Similar documents
Lecture 2: Metrics to Evaluate Systems

CMP N 301 Computer Architecture. Appendix C

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

CSCI-564 Advanced Computer Architecture

ICS 233 Computer Architecture & Assembly Language

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

CMP 334: Seventh Class

Lecture 3, Performance

Computer Architecture

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture 3, Performance

Measurement & Performance

Measurement & Performance

Goals for Performance Lecture

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

EC 413 Computer Organization

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Lecture 12: Adders, Sequential Circuits


Processor Design & ALU Design

Adders, subtractors comparators, multipliers and other ALU elements

Lecture 14: State Tables, Diagrams, Latches, and Flip Flop

CPU DESIGN The Single-Cycle Implementation

Simple Instruction-Pipelining. Pipelined Harvard Datapath

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Lecture 12: Adders, Sequential Circuits

Adders, subtractors comparators, multipliers and other ALU elements

Logic Design II (17.342) Spring Lecture Outline

CMP 338: Third Class

Performance, Power & Energy

Lecture 13: Sequential Circuits

Simple Instruction-Pipelining. Pipelined Harvard Datapath

COVER SHEET: Problem#: Points

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

CS 700: Quantitative Methods & Experimental Design in Computer Science

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M.

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

ALU A functional unit

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Counters. We ll look at different kinds of counters and discuss how to build them

Project Two RISC Processor Implementation ECE 485

3. (2) What is the difference between fixed and hybrid instructions?

[2] Predicting the direction of a branch is not enough. What else is necessary?

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

COE 328 Final Exam 2008

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Sequential Logic Worksheet

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

4. (3) What do we mean when we say something is an N-operand machine?

ENEE350 Lecture Notes-Weeks 14 and 15

/ : Computer Architecture and Design

Basic Computer Organization and Design Part 3/3

Formal Verification of Systems-on-Chip

ww.padasalai.net

EE 660: Computer Architecture Out-of-Order Processors

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

[2] Predicting the direction of a branch is not enough. What else is necessary?

King Fahd University of Petroleum and Minerals College of Computer Science and Engineering Computer Engineering Department

A Second Datapath Example YH16

Fundamentals of Computer Systems

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

EET 310 Flip-Flops 11/17/2011 1

Portland State University ECE 587/687. Branch Prediction

EE141- Spring 2007 Digital Integrated Circuits

Analytical Modeling of Parallel Systems

Formal Verification of Systems-on-Chip Industrial Practices

L07-L09 recap: Fundamental lesson(s)!

CS61C : Machine Structures

Introduction The Nature of High-Performance Computation

Compiling Techniques

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

EECS 579: Logic and Fault Simulation. Simulation

CS61C : Machine Structures

Department of Electrical and Computer Engineering The University of Texas at Austin

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

CPE100: Digital Logic Design I

CprE 281: Digital Logic

Lecture 3 Review on Digital Logic (Part 2)

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

CMSC 313 Lecture 25 Registers Memory Organization DRAM

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

ALU, Latches and Flip-Flops

CMPT-150-e1: Introduction to Computer Design Final Exam

Figure 4.9 MARIE s Datapath

UNIVERSITY OF WISCONSIN MADISON

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

High Performance Computing

Implementing the Controller. Harvard-Style Datapath for DLX

Transcription:

Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4: Loads/Stores and RISC/CISC Turn in HW1 Guest teacher, Manju Shevgoor, on Monday 1

An Alternative Perspective - I Each program is assumed to run for an equal number of cycles, so we re fair to each program The number of instructions executed per cycle is a measure of how well a program is doing on a system The appropriate summary measure is sum of IPCs or AM of IPCs = 1.2 instr + 1.8 instr + 0.5 instr cyc cyc cyc This measure implicitly assumes that 1 instr in prog-a has the same importance as 1 instr in prog-b 2

An Alternative Perspective - II Each program is assumed to run for an equal number of instructions, so we re fair to each program The number of cycles required per instruction is a measure of how well a program is doing on a system The appropriate summary measure is sum of CPIs or AM of CPIs = 0.8 cyc + 0.6 cyc + 2.0 cyc instr instr instr This measure implicitly assumes that 1 instr in prog-a has the same importance as 1 instr in prog-b 3

AM vs. GM GM of IPCs = 1 / GM of CPIs AM of IPCs represents thruput for a workload where each program runs sequentially for 1 cycle each; but high-ipc programs contribute more to the AM GM of IPCs does not represent run-time for any real workload (what does it mean to multiply instructions?); but every program s IPC contributes equally to the final measure 4

Problem 6 My new laptop has a clock speed that is 30% higher than the old laptop. I m running the same binaries on both machines. Their IPCs are listed below. I run the binaries such that each binary gets an equal share of CPU time. What speedup is my new laptop providing? P1 P2 P3 AM GM Old-IPC 1.2 1.6 2.0 1.6 1.57 New-IPC 1.6 1.6 1.6 1.6 1.6 AM of IPCs is the right measure. Could have also used GM. Speedup with AM would be 1.3. 5

Speedup Vs. Percentage Speedup is a ratio = old exec time / new exec time Improvement, Increase, Decrease usually refer to percentage relative to the baseline = (new perf old perf) / old perf A program ran in 100 seconds on my old laptop and in 70 seconds on my new laptop What is the speedup? (1/70) / (1/100) = 1.42 What is the percentage increase in performance? ( 1/70 1/100 ) / (1/100) = 42% What is the reduction in execution time? 30% 6

Building a Car Unpipelined Start and finish a job before moving to the next Jobs Time 7

The Assembly Line Pipelined Break the job into smaller stages A B C A B C Jobs A B C A B C Time 8

Clocks and Latches Stage 1 Stage 2 9

Clocks and Latches Stage 1 L Stage 2 L Clk 10

Some Equations Unpipelined: time to execute one instruction = T + Tovh For an N-stage pipeline, time per stage = T/N + Tovh Total time per instruction = N (T/N + Tovh) = T + N Tovh Clock cycle time = T/N + Tovh Clock speed = 1 / (T/N + Tovh) Ideal speedup = (T + Tovh) / (T/N + Tovh) Cycles to complete one instruction = N Average CPI (cycles per instr) = 1 11

Problem 1 An unpipelined processor takes 5 ns to work on one instruction. It then takes 0.2 ns to latch its results into latches. I was able to convert the circuits into 5 equal sequential pipeline stages. Answer the following, assuming that there are no stalls in the pipeline. What are the cycle times in the two processors? What are the clock speeds? What are the IPCs? How long does it take to finish one instr? What is the speedup from pipelining? 12

Problem 1 An unpipelined processor takes 5 ns to work on one instruction. It then takes 0.2 ns to latch its results into latches. I was able to convert the circuits into 5 equal sequential pipeline stages. Answer the following, assuming that there are no stalls in the pipeline. What are the cycle times in the two processors? 5.2ns and 1.2ns What are the clock speeds? 192 MHz and 833 MHz What are the IPCs? 1 and 1 How long does it take to finish one instr? 5.2ns and 6ns What is the speedup from pipelining? 833/192 = 4.34 13

Problem 2 An unpipelined processor takes 5 ns to work on one instruction. It then takes 0.2 ns to latch its results into latches. I was able to convert the circuits into 5 sequential pipeline stages. The stages have the following lengths: 1ns; 0.6ns; 1.2ns; 1.4ns; 0.8ns. Answer the following, assuming that there are no stalls in the pipeline. What is the cycle time in the new processor? What is the clock speed? What is the IPC? How long does it take to finish one instr? What is the speedup from pipelining? What is the max speedup from pipelining? 14

Problem 2 An unpipelined processor takes 5 ns to work on one instruction. It then takes 0.2 ns to latch its results into latches. I was able to convert the circuits into 5 sequential pipeline stages. The stages have the following lengths: 1ns; 0.6ns; 1.2ns; 1.4ns; 0.8ns. Answer the following, assuming that there are no stalls in the pipeline. What is the cycle time in the new processor? 1.6ns What is the clock speed? 625 MHz What is the IPC? 1 How long does it take to finish one instr? 8ns What is the speedup from pipelining? 625/192 = 3.26 What is the max speedup from pipelining? 5.2/0.2 = 26 15

A 5-Stage Pipeline Source: H&P textbook 16

A 5-Stage Pipeline Use the PC to access the I-cache and increment PC by 4 17

A 5-Stage Pipeline Read registers, compare registers, compute branch target; for now, assume branches take 2 cyc (there is enough work that branches can easily take more) 18

A 5-Stage Pipeline ALU computation, effective address computation for load/store 19

A 5-Stage Pipeline Memory access to/from data cache, stores finish in 4 cycles 20

A 5-Stage Pipeline Write result of ALU computation or load into register file 21

RISC/CISC Loads/Stores 22

Problem 3 For the following code sequence, show how the instrs flow through the pipeline: ADD R1, R2, R3 BEZ R4, [R5] LD [R6] R7 ST [R8] R9 23

Pipeline Summary RR ALU DM RW ADD R1, R2, R3 Rd R1,R2 R1+R2 -- Wr R3 BEZ R1, [R5] Rd R1, R5 -- -- -- Compare, Set PC LD 8[R3] R6 Rd R3 R3+8 Get data Wr R6 ST 8[R3] R6 Rd R3,R6 R3+8 Wr data -- 24

Problem 4 Convert this C code into equivalent RISC assembly instructions a[i] = b[i] + c[i]; 25

Problem 4 Convert this C code into equivalent RISC assembly instructions a[i] = b[i] + c[i]; LD [R1], R2 # R1 has the address for variable i MUL R2, 8, R3 # the offset from the start of the array ADD R4, R3, R7 # R4 has the address of a[0] ADD R5, R3, R8 # R5 has the address of b[0] ADD R6, R3, R9 # R6 has the address of c[0] LD [R8], R10 # Bringing b[i] LD [R9], R11 # Bringing c[i] ADD R10, R11, R12 # Sum is in R12 ST [R7], R12 # Putting result in a[i] 26

Problem 5 Design your own hypothetical 8-stage pipeline. 27

Title Bullet 28