Goals for Performance Lecture

Similar documents
Performance of Computers. Performance of Computers. Defining Performance. Forecast

In 1980, the yield = 48% and the Die Area = 0.16 from figure In 1992, the yield = 48% and the Die Area = 0.97 from figure 1.31.

CMP 334: Seventh Class

Lecture 3, Performance

Lecture 3, Performance

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Computer Architecture

Lecture 2: Metrics to Evaluate Systems

CS 700: Quantitative Methods & Experimental Design in Computer Science

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Lecture: Pipelining Basics

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Measurement & Performance

Measurement & Performance

CMP N 301 Computer Architecture. Appendix C

Lecture 13: Sequential Circuits, FSM

TEST 1 REVIEW. Lectures 1-5

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CSCI-564 Advanced Computer Architecture

CMP 338: Third Class

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Digital Techniques. Figure 1: Block diagram of digital computer. Processor or Arithmetic logic unit ALU. Control Unit. Storage or memory unit

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Modern Computer Architecture

ENEE350 Lecture Notes-Weeks 14 and 15

Menu. Binary Adder EEL3701 EEL3701. Add, Subtract, Compare, ALU

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy

ICS 233 Computer Architecture & Assembly Language

Performance, Power & Energy

COVER SHEET: Problem#: Points

Lecture 13: Sequential Circuits, FSM

Adders, subtractors comparators, multipliers and other ALU elements

[2] Predicting the direction of a branch is not enough. What else is necessary?

Introduction The Nature of High-Performance Computation

From Sequential Circuits to Real Computers

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Lecture 11. Advanced Dividers

[2] Predicting the direction of a branch is not enough. What else is necessary?

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

CPU Scheduling. CPU Scheduler

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

Processor Design & ALU Design

= ( 1 P + S V P) 1. Speedup T S /T V p

ECE/CS 250 Computer Architecture

Lecture 10: Synchronous Sequential Circuits Design

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

Chapter 1 Exercises. 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer.

ALU A functional unit

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

Formal Verification of Systems-on-Chip

Clock Arithmetic. 1. If it is 9 o clock and you get out of school in 4 hours, when do you get out of school?

EE 354 Fall 2013 Lecture 10 The Sampling Process and Evaluation of Difference Equations

Formal verification of IA-64 division algorithms

Lecture 4: Linear Algebra 1

Boolean Algebra and Digital Logic 2009, University of Colombo School of Computing

COE 328 Final Exam 2008

/ : Computer Architecture and Design

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

CS-683: Advanced Computer Architecture Course Introduction

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

Combinatorial Logic Design Multiplexers and ALUs CS 64: Computer Organization and Design Logic Lecture #13

Cost/Performance Tradeoff of n-select Square Root Implementations

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

COMPUTER SCIENCE TRIPOS

Binary addition example worked out

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Analytical Modeling of Parallel Systems

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

L07-L09 recap: Fundamental lesson(s)!

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

2. Accelerated Computations

CprE 281: Digital Logic

Binary addition by hand. Adding two bits

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Design at the Register Transfer Level

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Project Two RISC Processor Implementation ECE 485

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

CSE370: Introduction to Digital Design

Summarizing Measured Data

From Sequential Circuits to Real Computers

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

EC 413 Computer Organization

CPE100: Digital Logic Design I

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

King Fahd University of Petroleum and Minerals College of Computer Science and Engineering Computer Engineering Department

CSCE 564, Fall 2001 Notes 6 Page 1 13 Random Numbers The great metaphysical truth in the generation of random numbers is this: If you want a function

Construction of a reconfigurable dynamic logic cell

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Transcription:

Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s Law Know how to do problems in the homework!

Definitions Performance is in units of things-per-time Miles per hour, bits per second, widgets per day Bigger is better If we are primarily concerned with response time: Performance(x) = 1 / ExecutionTime(x) X is n times faster than Y means n = Performance(X) / Performance(Y) = Speedup If X is 1.yz times faster than Y, we can informally say that X is yz% faster than Y. Speedup is better.

Latency vs. Throughput Latency (Response Time) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase?

Clock Cycles Instead of reporting execution time in seconds, we often use cycles: Clock ticks indicate when to start activities Cycle time = time between ticks = seconds per cycle Clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec) seconds program =cycles program seconds cycle A 200 MHz clock has a cycle time of time 1 200 10 6 109 =5 nanoseconds

How many cycles in a program? Could assume that # of cycles = # of instructions... 6th 5th 4th 3rd instruction 2nd instruction 1st instruction This assumption is incorrect: different instructions take different amounts of time on different machines (even with the same instruction set). Why? time

Different #s of cycles for diff nt instrs time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)

Example instruction latencies Imagine Stream Processor: On ALU: Integer adds: 2 cycles FP adds: 4 cycles Logic ops (and, or, xor): 1 Equality: 1 < or >: 2 Shifts: 1 Float->int: 3 Int->float: 4 Select (a?b:c): 1 Other functional units: Integer multiply: 4 Integer divide: 22 Integer remainder: 23 FP multiply: 4 FP divide: 17 FP sqrt: 16

CPI How many clock cycles, on average, does it take for every instruction executed? We call this CPI ( Cycles Per Instruction ). Its inverse (1/CPI) is IPC ( Instructions Per Cycle ). CISC machines: this number is high(er) RISC machines: this number is low(er)

CPI: Average Cycles per Instruction CPI = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPI = CPI F where F = I i i i i i = 1 Instruction Count On Imagine, integer adds are 2 cycles, FP adds are 4. Consider an application that has 1/3 integer adds and 2/3 FP adds. What is its CPI? Given a 3 GHz machine, how many instrs/sec?

The Performance Equation Time = Cycle Time * CPI * Instruction Count = seconds/cycle * cycles/instr * instrs/program => seconds/program Performance = Clock Rate * IPC * 1/I = cycles/second * instr/cycle * program/instr => programs/second Clock rate * IPC = instr/second = IPS (MIPS) The only reliable measure of computer performance is time.

Remember Performance is specific to a particular program/s Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Pitfall: expecting improvement in one aspect of a machine s performance to affect the total performance You should not always believe everything you read! Read carefully! (see newspaper articles)

Speedup due to enhancement E: Amdahl s Law ExTime w/o E Performance w/ E Speedup(E) = -------------------- = ------------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected: ExTime (with E) = ((1-F) + F/S) * ExTime(without E) 1 Speedup (with E) = -------------- (1-F) + F/S Design Principle: Make the common case fast! There are many ways to express Amdahl s Law!

Undergrad Productivity Average CPE student spends: 4 hours sleeping 2 hours eating 18 hours studying Magic pill (purchased from dude at the UU) gives you all sleeping, eating in 1 minute! What s the speedup on sleeping/eating? How much more productive can you get?

Undergrad Productivity 1 Speedup (with E) = -------------- (1-F) + F/S F = accelerated fraction = 0.25 (6 hrs/24 hrs) S = speedup = 6 hrs / 1 minute = 360 Overall speedup: 1 / [(1-0.25) + (0.25/360)] ~= 1 / (1-0.25) ~= 1.33 33% more productive!