Lecture 5: Performance and Efficiency. James C. Hoe Department of ECE Carnegie Mellon University

Similar documents
Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 2: Metrics to Evaluate Systems

Performance and Scalability. Lars Karlsson

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

EECS150 - Digital Design Lecture 22 Power Consumption in CMOS. Announcements

Measurement & Performance

Measurement & Performance

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

CS 700: Quantitative Methods & Experimental Design in Computer Science

Parallel Performance Theory

Math 138: Introduction to solving systems of equations with matrices. The Concept of Balance for Systems of Equations

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

EE115C Winter 2017 Digital Electronic Circuits. Lecture 6: Power Consumption

Designing Information Devices and Systems I Summer 2017 D. Aranki, F. Maksimovic, V. Swamy Homework 5

Scheduling I. Today. Next Time. ! Introduction to scheduling! Classical algorithms. ! Advanced topics on scheduling

Lecture 2: CMOS technology. Energy-aware computing

1.7 Digital Logic Inverters

Scheduling I. Today Introduction to scheduling Classical algorithms. Next Time Advanced topics on scheduling

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CMSC 313 Lecture 25 Registers Memory Organization DRAM

Parallel Performance Theory - 1

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

ECE 574 Cluster Computing Lecture 20

Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8. This homework is due October 26, 2015, at Noon.

ECE 669 Parallel Computer Architecture

Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

EECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders

- there will be midterm extra credit available (described after 2 nd midterm)

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

4: Some Model Programming Tasks

ENGG 1203 Tutorial_9 - Review. Boolean Algebra. Simplifying Logic Circuits. Combinational Logic. 1. Combinational & Sequential Logic

Module - 19 Gated Latches

Chapter 7. Sequential Circuits Registers, Counters, RAM

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Power Dissipation. Where Does Power Go in CMOS?

Lecture 15: Scaling & Economics

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Combinational vs. Sequential. Summary of Combinational Logic. Combinational device/circuit: any circuit built using the basic gates Expressed as

EECS 270 Midterm 2 Exam Answer Key Winter 2017

ALU A functional unit

! Memory. " RAM Memory. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Module 1: Analyzing the Efficiency of Algorithms

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

EEC 118 Lecture #6: CMOS Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

EEC 116 Lecture #5: CMOS Logic. Rajeevan Amirtharajah Bevan Baas University of California, Davis Jeff Parkhurst Intel Corporation

1300 Linear Algebra and Vector Geometry Week 2: Jan , Gauss-Jordan, homogeneous matrices, intro matrix arithmetic

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Instructor: Mohsen Imani. Slides from Tajana Simunic Rosing

Ch 01. Analysis of Algorithms

Lecture 2: Linear regression

Introduction to Physics Physics 114 Eyres

Today HW#4 pushed back to 8:00 am Thursday Exam #1 is on Thursday Feb. 11

Lecture 19: The Determinant

Tradeoff between Reliability and Power Management

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

Lecture 6.1 Work and Energy During previous lectures we have considered many examples, which can be solved using Newtonian approach, in particular,

1 Maintaining a Dictionary

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Performance of Computers. Performance of Computers. Defining Performance. Forecast

University of Maryland Department of Physics. Spring 2009 Final Exam 20. May (175 points) Post grades on web? (Initial, please) Yes No

ECE321 Electronics I

csci 210: Data Structures Program Analysis

Where Does Power Go in CMOS?

13.1 Digital Logic Inverters

Ch01. Analysis of Algorithms

Lecture 4: Linear Algebra 1

Modern Computer Architecture

Latches. October 13, 2003 Latches 1

Intro To Digital Logic

Computing via boolean logic. COS 116: 3/8/2011 Sanjeev Arora

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Goals for Performance Lecture

Vectors Year 12 Term 1

Performance, Power & Energy

Moores Law for DRAM. 2x increase in capacity every 18 months 2006: 4GB

ENERGY AND TIME CONSTANTS IN RC CIRCUITS By: Iwana Loveu Student No Lab Section: 0003 Date: February 8, 2004

Today. ESE532: System-on-a-Chip Architecture. Why Care? Message. Scaling. Why Care: Custom SoC

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Introduction to Algorithms

Scaling of MOS Circuits. 4. International Technology Roadmap for Semiconductors (ITRS) 6. Scaling factors for device parameters

SPEED SCALING FOR ENERGY AWARE PROCESSOR SCHEDULING: ALGORITHMS AND ANALYSIS

Transcription:

18 643 Lecture 5: Performance and Efficiency James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L05 S1, James C. Hoe, CMU/ECE/CALCM, 2017

18 643 F17 L05 S2, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: appreciate the many subtleties and dimensionalities of performance Digested from three 18 447 lectures Notices Handout #3: lab 1, due noon, 9/22 ZedBoard ready for pick up (see Handout #3a) Recitation starts this week, Wed 4:30~5:20 Readings (skim for background) Kung, Memory requirements for balanced computer architectures, ISCA 1986. Shao, et al., Aladdin:..., ISCA, 2014.

ECE is loaning to you The ZedBoard Kit 1x Avnet ZedBoard 7020 baseboard 1x 12 V power supply 1x 4 GB SD Card (don t need it right away) 2x Micro USB cable You will treat it like your grade depended on it You will return all of it in perfect condition, or else 18 643 F17 L05 S3, James C. Hoe, CMU/ECE/CALCM, 2017

Looking Ahead Lab 1 (wk3/4): get cozy with Vivado most important: learn logic analyzer and eclipse debugger Lab 2 (wk5/6): meet Vivado HLS most important: decide if you would use it Lab 3 (wk7/8): hands on with AFU most important: have confidence it can work Project... 18 643 F17 L05 S4, James C. Hoe, CMU/ECE/CALCM, 2017

Performance is about time To the first order, performance 1 / time Two very different kinds of performance!! latency = time between start and finish of a task throughput = number of tasks finished in a given unit of time (a rate measure) Either way, shorter the time, higher the performance, but... 18 643 F17 L05 S5, James C. Hoe, CMU/ECE/CALCM, 2017

Throughput 1/Latency If it takes T sec to do N tasks, throughput=n/t; latency=t/n? If it takes t sec to do 1 task, latency=t; throughput=1/t? When there is concurrency, throughput 1/latency t t t t T Optimizations can tradeoff one for the other (think bus vs F1 race car) 18 643 F17 L05 S6, James C. Hoe, CMU/ECE/CALCM, 2017

18 643 F17 L05 S7, James C. Hoe, CMU/ECE/CALCM, 2017 Throughput Throughput Throughput becomes a function of N when there is a non recurring start up cost (aka overhead) E.g., DMA transfer on a bus bus throughput raw = 1 Byte / (10 9 sec) 10 6 sec to setup a DMA throuhgput effective to send 1B, 1KB, 1MB, 1GB? For start up time=t s and throughput raw =1/t 1 throughput effective = N / (t s + N t 1 ) if t s >> N t 1, throughput effective N/t s if t s << N t 1, throughput effective 1/t 1 we say t s is amortized in the latter case

Latency Latency What are you doing during the latency period? Latency = hands on time + hands off time In the DMA example CPU is busy for the t s to setup the DMA CPU has to wait N t 1 for DMA to complete CPU could be doing something else during N t 1 to hide that latency CPU bus t s work t s work t s w N t 1 N t 1 18 643 F17 L05 S8, James C. Hoe, CMU/ECE/CALCM, 2017

Relative Performance Pop Quiz: if X is 50% slower than Y and latency X =1.0s, what is latency Y? Case 1: L Y = 0.5s since L Y /L X =0.5 Case 2: L Y = 0.66666s since L X /L Y =1.5 English language is imprecise 18 643 F17 L05 S9, James C. Hoe, CMU/ECE/CALCM, 2017

Fixing the language, a la H&P Xis n times faster than Y means n = performance X / performance Y = throughput X / throughput Y = latency Y / latency X = speedup from Y to X Xis m% faster than Y means 1+m/100 = Performance X / Performance Y Delete slower from your dictionary for case 1 say, Y is 100% faster than X in latency for case 2 say, Y is 50% faster than X in latency 18 643 F17 L05 S10, James C. Hoe, CMU/ECE/CALCM, 2017

18 643 F17 L05 S11, James C. Hoe, CMU/ECE/CALCM, 2017 Faster!=Faster Given two designs X and Y, Xmay be m% faster than Y on input A Xmay be n% (where m!=n) faster than Y on input B Ymay be k% faster than X on input C Which is faster and by how much? depends on which input(s) you care about if multiple, also depend on relative importance Many ways to summarize performance into a scalar metric to simplify comparison more wrong ways than right when in doubt, present the complete story go read the H&P chapter on performance

Multi Dimensional Optimizations HW design has many optimization dimensions by area, by resource type utilization performance and latency power and energy complexity, risk, social factors... Cannot optimize individual metrics without considering tradeoff between them, e.g., reasonable to spend more power for performance converse also true (lower perf. for less power) but never more power for lower performance 18 643 F17 L05 S12, James C. Hoe, CMU/ECE/CALCM, 2017

Pareto Optimality (2D example) power 18 643 F17 L05 S13, James C. Hoe, CMU/ECE/CALCM, 2017 Pareto Front 1/perf All points on front are optimal (can t do better) How to select between them?

Application Defined Composite Metrics Define scalar function to reflect desiderata incorporate dimensions and their relationships E.g., energy delay (cost) product smaller the better can t cheat by minimizing one ignoring others does it have to a physical meaning?? Floors and ceilings real life designs more often about good enough than optimal e.g., meet a perf. floor under a power(cost) ceiling (minimize design time, i.e., stop when you get there) 18 643 F17 L05 S14, James C. Hoe, CMU/ECE/CALCM, 2017

Power!= Energy (do not confuse them) 18 643 F17 L05 S15, James C. Hoe, CMU/ECE/CALCM, 2017

Power = Energy / time Energy (Joule) dissipated as heat when charge flow from VDD to GND takes a certain amount of energy per operation, e.g., addition, reg read/write, (dis)charge a node to the first order, energy work Power (Watt=Joule/s) is rate of energy dissipation more op/sec then more Joules/sec to the first order, power performance 18 643 F17 L05 S16, James C. Hoe, CMU/ECE/CALCM, 2017 It is all very easy if performance frequency Power=(½CV 2 ) f

Power and Performance not Separable Easy to minimize power if don t care about performance Expect superlinear increase in power to increase performance slower design is simpler lower frequency needs lower voltage Corollary: Lower perf also use lower J/op (=slope from origin) All in all, slower is more energy/power efficient 18 643 F17 L05 S17, James C. Hoe, CMU/ECE/CALCM, 2017 Power (Joule/sec) Power Perf k>1 Perf (op/sec)

Slower could use more energy (if done inefficiently) Devices leak charge even when doing no ops so called leakage current (I leakage ) power total = switching power + static power = (J/op) @perf (#op/runtime) + (VDD I leakage ) energy total = (J/op) @perf #op + (VDD I leakage ) runtime Slower reduces power but could increase energy energy for task (J) 18 643 F17 L05 S18, James C. Hoe, CMU/ECE/CALCM, 2017 perf (op/sec)

Perf/Watt!= Perf/Watt Perf/Watt is a normalized measure hides the scale of problem and platform recall, Watt perf k for some k>1 10 GFLOPS/Watt at 1W is a very different design problem than at 1KW or 1MW or 1GW say 10 GFLOPS/Watt on a <GPGPU,problem> now take 1000 GPUGPUs to the same problem realized perf is < 1000x (less than perfect parallelism) required power > 1000x (energy to move data & heat) In general be careful with normalized metrics 18 643 F17 L05 S19, James C. Hoe, CMU/ECE/CALCM, 2017

Parallelism and Performance 18 643 F17 L05 S20, James C. Hoe, CMU/ECE/CALCM, 2017

Parallelism Defined T 1 (work measured in time): time to do work with 1 PE T (critical path): time to do work with infinite PEs T bounded by dataflow dependence Average parallelism: P avg = T 1 / T For a system with p PEs T p max{ T 1 /p, T } When P avg >>p T p T 1 /p, aka linear speedup x = a + b; y = b * 2 z =(x y) * (x+y) a x + - * *2 + b y 18 643 F17 L05 S21, James C. Hoe, CMU/ECE/CALCM, 2017

Linear Parallel Speedup Ideally, parallel speedup is linear with p runtime sequential speedup= runtimeparallel runtime speedup 1/p 1 p= 1 2 3 4 5...... p= 1 2 3 4 5...... 18 643 F17 L05 S22, James C. Hoe, CMU/ECE/CALCM, 2017

Linear Speedup!=Linear Speedup S 4 3 2 1 18 643 F17 L05 S23, James C. Hoe, CMU/ECE/CALCM, 2017 32 64 p How could this be?

It could be worse...... S limited scalability, P avg < p 4 3 2 1 18 643 F17 L05 S24, James C. Hoe, CMU/ECE/CALCM, 2017 1 2 3 4 pp 8 p

Parallelization Overhead Best parallel and seq. algo. need not be the same best parallel algo. often worse at p=1 if runtime parallel@p=1 = K runtime sequential then best case speedup = p/k Communication between PEs not instantaneous extra time for the act of sending or receiving data as if adding more work (T 1 ) extra time waiting for data to travel between PEs as if adding critical path (T ) If overhead grows with P, speedup can even fall 18 643 F17 L05 S25, James C. Hoe, CMU/ECE/CALCM, 2017

Parallelization not just about speedup For a given functionality, non linear tradeoff between power and performance slower design is simpler lower frequency needs lower voltage For the same throughput, replacing 1 module by 2 half as fast reduces total power and energy Power (Joule/sec) Better to replace 1 of this by 2 of these; or N of these Power Perf k>1 Perf (op/sec) Good hardware designs derive performance from parallelism 18 643 F17 L05 S26, James C. Hoe, CMU/ECE/CALCM, 2017

18 643 F17 L05 S27, James C. Hoe, CMU/ECE/CALCM, 2017 Arithmetic Intensity An algorithm has a cost in terms of operation count runtime compute bound = # operations / FLOPS An algorithm also has a cost in terms of number of bytes communicated (ld/st or send/receive) runtime BW bound = # bytes / BW Which one dominates depends on ratio of FLOPS and BW of platform ratio of ops and bytes of algorithm Average Arithmetic Intensity (AI) how many ops performed per byte accessed # operations / # bytes

Roofline Performance Model [Williams&Patterson, 2006] Attained Performance of a system (op/sec) 18 643 F17 L05 S28, James C. Hoe, CMU/ECE/CALCM, 2017 perf compute bound =FLOPS runtime > max ( # op/flops, # byte/bw} > #op max(1/flops, 1/(AI BW)} perf = min(flops, AI BW) AI of application

18 643 F17 L05 S29, James C. Hoe, CMU/ECE/CALCM, 2017 Simple AI Example: MMM for(i=0; i<n; i++) for(j=0; j<n; j++) for(k=0; k<n; k++) C[i][j]+=A[i][k]*B[k][j]; N 2 data parallel dot product s Assume N is large s.t. 1 row/col too large for on chip Operation count: N 3 float mult and N 3 float add External memory access (assume 4 byte floats) 2N 3 4 byte reads (of A and B) from DRAM... N 2 4 byte writes (of C) to DRAM... Arithmetic Intensity 2N 3 /(4 2N 3 )=1/4 GTX1080: 8 TFLOPS vs 320GByte/sec

Less Simple AI Example: MMM for(i0=0; i0<n; i0+=n b ) for(j0=0; j0<n; j0+=n b ) for(k0=0; k0<n; k0+=n b ) { for(i=i0;i<i0+n b ;i++) for(j=j0;j<j0+n b ;j++) for(k=k0;k<k0+n b ;k++) C[i][j]+=A[i][k]*B[k][j]; } Imagine a N/N b x N/N b MATRIX of N b xn b matrices inner triple is straightforward matrix matrix mult outer triple is MATRIX MATRIX mult To improve AI, hold N b xn b sub matrices on chip for data reuse 18 643 F17 L05 S30, James C. Hoe, CMU/ECE/CALCM, 2017 need to copy block (not shown)

AI of blocked MMM Kernel (N b xn b ) for(i=i0;i<i0+n b ;i++) for(j=j0;j<j0+n b ;j++) { t=c[i][j]; for(k=k0;k<k0+n b ;k++) t+=a[i][k]*b[k][j]; C[i][j]=t; } Operation count: N b3 float mult and N b3 float add When A, B fit in scratchpad (2xN b2 x4 bytes) 2N b3 4 byte on chip reads (A, B) (fast) 3N b2 4 byte off chip DRAM read A, B, C (slow) N b2 4 byte off chip DRAM writeback C (slow) Arithmetic Intensity = 2N b3 /(4 4N b2 )=N b /8 18 643 F17 L05 S31, James C. Hoe, CMU/ECE/CALCM, 2017 need to copy block (not shown)

AI and Scaling [Figure from P&H CO&D, COPYRIGHT 2009 Elsevier. ALL RIGHTS RESERVED.] AI is a function of the algorithm and the problem size Higher AI means more work per communication and therefore easier to parallelize and to scale [Figure 6.17, Computer Organization and Design] 18 643 F17 L05 S32, James C. Hoe, CMU/ECE/CALCM, 2017

Strong vs Weak Scaling Strong Scaling: T 1 (work) remains constant i.e., improve same workload with more PEs hard to maintain linearity as p grows toward p avg Weak Scaling: T 1 =p T 1 i.e., improve larger workload with more PEs S p =runtime sequential (p T 1 )/runtime parallel (p T 1 ) is this harder or easier? ans: depends on how T scales with T 1 and how AI scales with T 1 18 643 F17 L05 S33, James C. Hoe, CMU/ECE/CALCM, 2017

Amdahl s Law If only a fraction f is improved by a factor of s time sequential (1 f) f time parallelized (1 f) f/s 18 643 F17 L05 S34, James C. Hoe, CMU/ECE/CALCM, 2017 time parallel = time sequential ( (1 f) + f/s ) speedup = 1 / ( (1 f)+ f/s ) if f is small, s doesn t matter even when f is large, diminishing return on s; eventually 1 f dominates

Parting Thoughts Need to understand performance to get performance! HW/FPGA design involve many dimensions (each one nuanced) optimizations often involve tradeoff over simplifying is dangerous and misleading must understand application needs power and energy is first class Real life designs have non technical requirements 18 643 F17 L05 S35, James C. Hoe, CMU/ECE/CALCM, 2017