Performance of Computers. Performance of Computers. Defining Performance. Forecast

Similar documents
Measurement & Performance

Measurement & Performance

Goals for Performance Lecture

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

CS 700: Quantitative Methods & Experimental Design in Computer Science

Computer Architecture

Lecture 2: Metrics to Evaluate Systems

Lecture 3, Performance

Lecture 3, Performance

CMP 338: Third Class

Lecture: Pipelining Basics

CMP 334: Seventh Class

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

ICS 233 Computer Architecture & Assembly Language

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Summarizing Measured Data

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy

In 1980, the yield = 48% and the Die Area = 0.16 from figure In 1992, the yield = 48% and the Die Area = 0.97 from figure 1.31.

ECE/CS 552: Introduction To Computer Architecture 1

Modern Computer Architecture

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Chapter 1 Exercises. 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer.

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

Scheduling I. Today Introduction to scheduling Classical algorithms. Next Time Advanced topics on scheduling

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

ENEE350 Lecture Notes-Weeks 14 and 15

Scheduling I. Today. Next Time. ! Introduction to scheduling! Classical algorithms. ! Advanced topics on scheduling

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

Analysis of Software Artifacts

Big O (Asymptotic Upper Bound)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

CMP N 301 Computer Architecture. Appendix C

Qualitative vs Quantitative metrics

Big-O Notation and Complexity Analysis

Summarizing Measured Data

CSCI-564 Advanced Computer Architecture

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CPU Scheduling. CPU Scheduler

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

CPU SCHEDULING RONG ZHENG

CS Data Structures and Algorithm Analysis

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

csci 210: Data Structures Program Analysis

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Module 5: CPU Scheduling

Lecture 11. Advanced Dividers

Chapter 6: CPU Scheduling

TEST 1 REVIEW. Lectures 1-5

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

csci 210: Data Structures Program Analysis

Performance, Power & Energy

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Lecture 5: Performance and Efficiency. James C. Hoe Department of ECE Carnegie Mellon University

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Catie Baker Spring 2015

Lecture 13: Sequential Circuits, FSM

CS 100: Parallel Computing

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

COMP9334 Capacity Planning for Computer Systems and Networks

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

A Detailed Study on Phase Predictors

Algorithms Design & Analysis. Analysis of Algorithm

Algorithm Efficiency. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

COMP9334: Capacity Planning of Computer Systems and Networks

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

ALU A functional unit

Practical Tips for Modelling Lot-Sizing and Scheduling Problems. Waldemar Kaczmarczyk

ECE/CS 250 Computer Architecture

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration

Real-time operating systems course. 6 Definitions Non real-time scheduling algorithms Real-time scheduling algorithm

LAMC Intermediate I & II October 12, Oleg Gleizer. Warm-up

The Law of Averages. MARK FLANAGAN School of Electrical, Electronic and Communications Engineering University College Dublin

IOE 202: lectures 11 and 12 outline

CIS 371 Computer Organization and Design

Computer Architecture

Formal Verification of Systems-on-Chip

From Sequential Circuits to Real Computers

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

Scalability is Quantifiable

Lecture 4: Linear Algebra 1

Dense Arithmetic over Finite Fields with CUMODP

It ain t no good if it ain t snappy enough. (Efficient Computations) COS 116, Spring 2011 Sanjeev Arora

Introduction to Computer Science and Programming for Astronomers

Fall 2008 CSE Qualifying Exam. September 13, 2008

Performance and Scalability. Lars Karlsson

Transcription:

Performance of Computers Which computer is fastest? Not so simple scientific simulation - FP performance program development - Integer performance commercial work - I/O Performance of Computers Want to buy the fastest computer for what you want to do workload is important Want to design the fastest computer for what they want to pay BUT cost is an important criterion CS/ECE 552 Lecture Notes: Chapter 2 1 CS/ECE 552 Lecture Notes: Chapter 2 2 Forecast Defining Performance Time and performance Iron law MIPS and MFLOPS Which programs and how to average Amdahl s law What is important to who Computer system user minimize elapsed time for program = time_end - time_start called response time Computer center manager maximize completion rate = #jobs/second called throughput CS/ECE 552 Lecture Notes: Chapter 2 3 CS/ECE 552 Lecture Notes: Chapter 2 4

Response Time vs. Throughput Is throughput = 1/av. response time? only if NO overlap with overlap, throughput > 1/av.response time e.g., a lunch buffet - assume 5 entrees each person takes 2 minutes at every entree throughput is 1 person every 2 minutes BUT time to fill up tray is 10 minutes why and what would the throughput be, otherwise? because there are 5 people (each at 1 entree) simultaneously; if there is no such overlap throughput = 1/10 What is Performance for us? For computer architects CPU execution time = time spent running a program Because people like faster to be bigger to match intuition performance = 1/X time where X = response, CPU execution, etc. Elapsed time = CPU execution time + I/O wait We will concentrate mostly on CPU execution time CS/ECE 552 Lecture Notes: Chapter 2 5 CS/ECE 552 Lecture Notes: Chapter 2 6 Improve Performance Performance Comparison Improve (a) response time or (b) throughput? faster CPU both (a) and (b) Add more CPUs (b) but (a) may be improved due to less queueing Machine A is n times faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = n Machine A is x% faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = 1 + x/100 E.g., A 10s, B 15s 15/10 = 1.5 => A is 1.5 times faster than B 15/10 = 1 + 50/100 => A is 50% faster than B CS/ECE 552 Lecture Notes: Chapter 2 7 CS/ECE 552 Lecture Notes: Chapter 2 8

Breaking down Performance A program is broken into instructions H/W is aware of instructions, not programs At lower level, H/W breaks intructions into cycles lower level state machines change state every cycle E.g., 500 MHz PentiumIII runs 500M cycles/sec, 1 cycle = 2 ns E.g., 2 GHz PentiumX will run 2G cycles/sec, 1 cycle = 0.5 ns Iron law Time/program = instrs/program x cycles/instr x sec/cycle sec/cycle (a.k.a. cycle time, clock time) - heartbeat of computer mostly determined by technology and CPU organization cycles/instr (a.k.a. CPI) mostly determined by ISA and CPU organization overlap among instructions makes this smaller instr/program (a.k.a. instruction count) instrs executed NOT static code mostly determined by program, compiler, ISA CS/ECE 552 Lecture Notes: Chapter 2 9 CS/ECE 552 Lecture Notes: Chapter 2 10 Our Goal Other Metrics Minimize time which is the product, NOT isolated terms Common error to miss terms while devising optimizations E.g., ISA change to decrease instruction count BUT leads to CPU organization which makes clock slower MIPS and MFLOPS MIPS = instruction count/(execution time x 10 6 ) = clock rate/(cpi x 10 6 ) BUT MIPS has problems CS/ECE 552 Lecture Notes: Chapter 2 11 CS/ECE 552 Lecture Notes: Chapter 2 12

Problems with MIPS Problems with MIPS E.g., without FP H/W, an FP op may take 50 single-cycle instrs with FP H/W only one 2-cycle instr Thus adding FP H/W CPI increases (why?) The FP op goes from 50/50 to 2/1 but instrs/prog decreases more (why?) each of the FP op reduces from 50 to 1, factor of 50 total execution time decreases For MIPS instrs/prog ignored MIPS gets worse! Ignore program Usually used to quote peak performance ideal conditions => guarantee not to exceed!! When is MIPS ok? same compiler and same ISA e.g., same binary running on Pentium Pro and Pentium why? instrs/prog is constant and may be ignored CS/ECE 552 Lecture Notes: Chapter 2 13 CS/ECE 552 Lecture Notes: Chapter 2 14 Other Metrics Rules MFLOPS = FP ops in program/(execution time x 10 6 ) Assuming FP ops independent of compiler and ISA Assumption not true may not have divide in ISA optimizing compilers Use ONLY Time Beware when reading, especially if details are omitted Beware of Peak Relative MIPS and normalized MFLOPS adds to confusion! (see book) CS/ECE 552 Lecture Notes: Chapter 2 15 CS/ECE 552 Lecture Notes: Chapter 2 16

Iron Law Example Machine A: clock 1 ns, CPI 2.0, for a program Machine B: clock 2 ns, CPI 1.2, for same program Which is faster and how much Time/program = instrs/program x cycles/instr x sec/cycle Time(A): N x 2.0 x 1 = 2N Iron Law Example Keep clock of A at 1 ns and clock of B at 2 ns For equal performance, if CPI of B is 1.2, what is CPI of A? Time(B)/Time(A) = 1 = (N x 2 x 1.2)/(N x 1 x CPI(A)) CPI(A) = 2.4 Time(B): N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 So, Machine A is 20% faster than Machine B for this program CS/ECE 552 Lecture Notes: Chapter 2 17 CS/ECE 552 Lecture Notes: Chapter 2 18 Iron Law Example Which Programs Keep CPI of A 2.0 and CPI of B 1.2 For equal performance, if clock of B is 2 ns, what is clock of A? Time(B)/Time(A) = 1 = (N x 2.0 x clock(a))/(n x 1.2 x 2) clock(a) = 1.2 ns Execution time of what Best case - you run the same set of programs everyday port them and time the whole workload In reality, use benchmarks programs chosen to measure performance predict performance of actual workload (hopefully) + saves effort and money representative? honest? CS/ECE 552 Lecture Notes: Chapter 2 19 CS/ECE 552 Lecture Notes: Chapter 2 20

How to average How to average Example (page 70) Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110 One answer: total execution time, then B is how much faster than A? 9.1 Another: arithmetic mean (same result) Arithmetic mean of times: n time() i n 1 for n programs AM(A) = 1001/2 = 500.5 AM(B) = 110/2 = 55 500.5/55 = 9.1 Valid only if programs run equally often, so use weight factors Weighted arithmetic mean: n ( weight() i time() i ) n 1 CS/ECE 552 Lecture Notes: Chapter 2 21 CS/ECE 552 Lecture Notes: Chapter 2 22 Other Averages Harmonic Mean E.g., 30 mph for first 10 miles 90 mph for next 10 miles. averatge speed? Harmonic mean of rates = 1 ------------------------------------- n 1 ----------------- rate() i n 1 Average speed = (30+90)/2 WRONG Use HM if forced to start and end with rates Average speed = total distance / total time (20 / (10/30+10/90)) 45 mph Trick to do arithmetic mean of times but using rates and not times CS/ECE 552 Lecture Notes: Chapter 2 23 CS/ECE 552 Lecture Notes: Chapter 2 24

Dealing with Ratios E.g., Machine A Machine B Program 1 1 10 Program 2 1000 100 If we take ratios, with respect to Machine A Machine A Machine B Program 1 1 10 Program 2 1 0.1 Dealing with Ratios average for machine A is 1, average for machine B is 5.05 If we take ratios, with respect to Machine B Machine A Machine B Program 1 0.1 1 Program 2 10 1 average for machine A = 5.05, average for machine B = 1 can t both be true! Don t use arithmetic mean on ratios (normalized numbers) CS/ECE 552 Lecture Notes: Chapter 2 25 CS/ECE 552 Lecture Notes: Chapter 2 26 Geometric Mean But.. Use geometric mean for ratios geometric mean of ratios = n n 1 Use GM if forced to use ratios ratio() i Geometric mean of ratios is not proportional to total time AM in example says machine B is 9.1 times faster GM says they are equal If we took total execution time, A and B are equal only if program 1 is run 100 times more often than program 2 Independent of reference machine (math property) In the example, GM for machine A is 1, for machine B is also 1 normalized with respect to either machine Generally, GM will mispredict for three or more machines CS/ECE 552 Lecture Notes: Chapter 2 27 CS/ECE 552 Lecture Notes: Chapter 2 28

Summary Benchmarks: SPEC95 Use AM for times Use HM if forced to use rates Use GM if forced to use ratios Better yet, use unnormalized numbers to compute time System Performance Evaluation Cooperative Latest is SPEC2K but Text uses SPEC95 8 integer and 10 floating point programs normalize run time with a SPARCstation 10/40 GM of the normalized times CS/ECE 552 Lecture Notes: Chapter 2 29 CS/ECE 552 Lecture Notes: Chapter 2 30 SPEC95 Some SPEC95 Programs Benchmark go m88ksim gcc compress li ijpeg perl vortex Description AI, plays go Motorola 88K chip simulator Gnu compiler Unix utility compresses files Lisp Interpreter Graphic (de)compression Unix utility text processor Database program Benchmark INT/FP Description m88ksim Integer Motorola 88K chip simulator gcc Integer Gnu compiler compress Integer Unix utility compresses files vortex Integer Database program su2cor FP Quantum physics; Monte carlo hydro2d FP Navier Stokes equations mgrid FP 3-D potential field wave5 FP Electromagnetic particle simulation CS/ECE 552 Lecture Notes: Chapter 2 31 CS/ECE 552 Lecture Notes: Chapter 2 32

Amdahl s Law Why does the common case matter the most? Speedup = old time/new time = new rate/old rate Let an optimization speed f fraction of time by a factor of s Spdup = [( 1 f ) + f ] oldtime ([( 1 f ) oldtime] + f s oldtime) = 1 ( 1 f + f s) Amdahl s Law Example Your boss asks you to improve Pentium Posterior performance by improve the ALU used 95% of time, by 10% improve the memory pipeline used 5%, by a factor of 10 Let f = fraction sped up and s = the speedup on that fraction new_time = (1-f)*old_time + (f/s)*old_time Speedup = new_rate / old_rate = old_time / new_time Speedup = old_time / ((1-f)*old_time + (f/s)*old_time) Amdahl s Law: Speedup = 1 / ((1-f) + (f/s)) CS/ECE 552 Lecture Notes: Chapter 2 33 CS/ECE 552 Lecture Notes: Chapter 2 34 Amdahl s Law Example, cont. Amdahl s Law: Limit Your boss asks you to improve Pentium Posterior performance by improve the ALU used 95% of time, by 10% improve the memory pipeline used 5%, by a factor of 10 f s Speedup 95% 1.10 1.094 5% 10 1.047 5% 1.052 1 lim -------------------------- s 1 f + f s 10 8 Speedup 6 4 2 1 = ---------- 1 f => Make common case fast 0 0 0.2 0.4 0.6 0.8 1 f CS/ECE 552 Lecture Notes: Chapter 2 35 CS/ECE 552 Lecture Notes: Chapter 2 36

Summary of Chapter 2 Time and performance: Machine A n times faster than Machine B iff Time(B)/Time(A) = n Iron Law: Time/prog = Instr count x CPI x Cycle time Other Metrics: MIPS and MFLOPS Beware of peak and omitted details Benchmarks: SPEC95 Summarize performance: AM for time, HM for rate, GM for ratio Amdahl s Law: Speedup = 1 ( 1 f + f s) - common case fast CS/ECE 552 Lecture Notes: Chapter 2 37