Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Similar documents
Performance, Power & Energy

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

EE241 - Spring 2001 Advanced Digital Integrated Circuits

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Lecture 2: Metrics to Evaluate Systems

Power Dissipation. Where Does Power Go in CMOS?

Lecture: Pipelining Basics

CSCI-564 Advanced Computer Architecture

Where Does Power Go in CMOS?

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

CMP N 301 Computer Architecture. Appendix C

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Lecture 8-1. Low Power Design

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ICS 233 Computer Architecture & Assembly Language

Computer Architecture

EE115C Winter 2017 Digital Electronic Circuits. Lecture 6: Power Consumption

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CMP 338: Third Class

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

EECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Circuit Modeling for Practical Many-core Architecture Design Exploration

EECS150 - Digital Design Lecture 22 Power Consumption in CMOS. Announcements

Microprocessor Power Analysis by Labeled Simulation

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Goals for Performance Lecture

8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

CMP 334: Seventh Class

Lecture 3, Performance

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

School of EECS Seoul National University

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Computer Architecture ELEC2401 & ELEC3441

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Announcements. EE141- Spring 2003 Lecture 8. Power Inverter Chain

Chapter 8. Low-Power VLSI Design Methodology

Announcements. EE141- Fall 2002 Lecture 7. MOS Capacitances Inverter Delay Power

Lecture 3, Performance

Intro To Digital Logic

CSE140L: Components and Design Techniques for Digital Systems Lab. Power Consumption in Digital Circuits. Pietro Mercati

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design

Measurement & Performance

Measurement & Performance

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Dynamic operation 20

VLSI Signal Processing

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Instructor: Mohsen Imani. Slides from Tajana Simunic Rosing

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

EECS 151/251A Homework 5

EC 413 Computer Organization

EECS 141 F01 Lecture 17

EE141Microelettronica. CMOS Logic

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Chapter 7. Sequential Circuits Registers, Counters, RAM

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

CIS 371 Computer Organization and Design

CPSC 3300 Spring 2017 Exam 2

Datapath Component Tradeoffs

The Elusive Metric for Low-Power Architecture Research

Digital Integrated Circuits A Design Perspective

Spiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Hw 6 due Thursday, Nov 3, 5pm No lab this week

Dynamic Combinational Circuits. Dynamic Logic

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Objective and Outline. Acknowledgement. Objective: Power Components. Outline: 1) Acknowledgements. Section 4: Power Components

The Linear-Feedback Shift Register

EEC 116 Lecture #5: CMOS Logic. Rajeevan Amirtharajah Bevan Baas University of California, Davis Jeff Parkhurst Intel Corporation

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Lecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Modern Computer Architecture

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

Administrative Stuff

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Delay and Energy Consumption Analysis of Conventional SRAM

Vector Lane Threading

Lecture 6 Power Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

UNIVERSITY OF CALIFORNIA, BERKELEY College of Engineering Department of Electrical Engineering and Computer Sciences

CS 700: Quantitative Methods & Experimental Design in Computer Science

EE241 - Spring 2005 Advanced Digital Integrated Circuits. Admin. Lecture 10: Power Intro

UNIVERSITY OF CALIFORNIA

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Transcription:

Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2

PERFORMANCE EVALUATION H. So, Sp10 Lecture 3 - ELEC8106/6102 3

What is good performance? Time needed to finish certain task(s) Number of tasks finished per unit time Latency Throughput H. So, Sp10 Lecture 3 - ELEC8106/6102 4

Latency vs Throughput (1) Low latency High throughput? High throughput Low Latency? High latency low throughput? Low throughput high latency? H. So, Sp10 Lecture 3 - ELEC8106/6102 5

Latency vs Throughput (2) Computer 1 and 2 must finish task A,B,C Computer 1 Finish task A takes 15s B takes 20s C takes 50s Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = 0.035 tasks / s Computer 2 Finish task A takes 20s B takes 25s C takes 45s Latency = 20s + 25s + 45s = 90s Throughput = 3 / 90s = 0.03 tasks/s Is Computer 1 faster than Computer 2? H. So, Sp10 Lecture 3 - ELEC8106/6102 6

Latency vs Throughput (3) What if Computer 2 can perform 3 tasks at the same time? Computer 1 Computer 2 Finish task A takes 15s B takes 20s C takes 50s Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = 0.035 tasks / s Finish task A takes 20s B takes 25s C takes 45s Latency = 45s Throughput = 3 / 45s = 0.067 tasks/s Is Computer 2 faster than Computer 1? H. So, Sp10 Lecture 3 - ELEC8106/6102 7

Latency vs Throughput (4) What if both Computer 1 and 2 can perform 2 tasks at the same time? Computer 1 A:15, B:20, C:50 Computer 2 A:20, B:25, C:45 C C A B A B Latency = 50s Throughput = 3 / 50s = 0.06 tasks / s Latency = 45s Throughput = 3 / 45s = 0.067 tasks/s Which computer is faster? H. So, Sp10 Lecture 3 - ELEC8106/6102 8

Latency vs Throughput (5) Both Computer 1 and 2 can perform 2 tasks at the same time. Define latency as time to get first result. Computer 1 A:15, B:20, C:50 Computer 2 A:20, B:25, C:45 C C A B A B First result = 15s Last result = 50s Throughput = 3 / 50s = 0.06 tasks / s First result = 20s Last result = 45s Throughput = 3 / 45s = 0.067 tasks/s H. So, Sp10 Lecture 3 - ELEC8106/6102 9

Latency vs Throughput (6) Both Computer 1 and 2 can perform 2 tasks at the same time. Tasks = ABCABC Computer 1 A:15, B:20, C:50 Computer 2 A:20, B:25, C:45 A B C C C A B C A B A B First result = 15s Last result = 85s Throughput = 6 / 85s = 0.07 tasks / s First result = 20s Last result = 90s Throughput = 3 / 45s = 0.067 tasks/s H. So, Sp10 Lecture 3 - ELEC8106/6102 10

Latency vs Throughput Summary Latency Time to first data/ response arrive Time for task to finish Indicates the responsiveness of a system Throughput Sustained rate of task completion Matters most when there are a lot of continuous input Especially with streaming input A long term efficiency measurement H. So, Sp10 Lecture 3 - ELEC8106/6102 11

Latency vs Throughput Summary Latency and throughput measure important in different scenarios The two has close tie to each other, but no obvious relationship Many factors affect latency/throughput Data input / Workload Scheduling etc H. So, Sp10 Lecture 3 - ELEC8106/6102 12

Performance: task completion Time to complete 1 task is a good way to measure general purpose computers Time to complete 1 task (latency): L = no. of instrs CPI f clk H. So, Sp10 Lecture 3 - ELEC8106/6102 13

How to improve speed? L = no. of instrs CPI f clk Decrease number of instruction Decrease cycles per instruction Increase clock frequency H. So, Sp10 Lecture 3 - ELEC8106/6102 14

Increase clock frequency Linear increase in performance But heat dissipation has prohibited simple clock frequency boost Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith H. So, Sp10 Lecture 3 - ELEC8106/6102 15

Improving speed compiler (micro) computer architecture L = no. of instrs CPI f clk NOTE: the number of instructions of a program is closely related to its CPI CPI changes depending on the app. H. So, Sp10 Lecture 3 - ELEC8106/6102 16

Review: CPI vs # of instructions A program executes the following instruction profile: Instruction Type Number Clock Cycle Add 2000 1 Multiply 1000 5 Division 500 20 Load 1000 8 Store 500 2 With a clock cycle time of 1ns, how long does the program takes to finish? What is the average CPI of the processor? L = (2000*1 + 1000*5 + 500*20 + 1000*8 + 500*2) * 1ns = 26 us Avg. CPI = 26,000 / 5000 = 5.2 H. So, Sp10 Lecture 3 - ELEC8106/6102 17

Amdahl s Law Overall speedup due to improving a fraction P with speed up of S is: 1 (1 P) + P S E.g. if P = 0.2, S=5, then overall speed up is 1 (1 0.2) + 0.2 5 =1.19 If the same improvement can be applied to a larger portion with P=0.9, then speedup = 1 (1 0.9) + 0.9 5 = 3.57 Always optimize for the common cases. H. So, Sp10 Lecture 3 - ELEC8106/6102 18

Instruction example revisit Instruction Type Number Clock Cycle Add 2000 1 Multiply 1000 5 Division 500 20 Load 1000 8 Store 500 2 If we can reduce execution speed of any one instruction, which instruction to optimize? Case 1: Optimize Add L = (2000*0.1 + 1000*5 + 500*20 + 1000*8 + 500*2) = 24.2ms (Speedup = 26/24.2 = 1.07) Case 2: Optimize Load L = (2000*1 + 1000*5 + 500*20 + 1000*0.8 + 500*2) = 18.8ms (Speedup = 26/18.8 = 1.38) H. So, Sp10 Lecture 3 - ELEC8106/6102 19

Compiler optimizations Decrease # of instructions E.g. Common subexpression elimination E.g. Constant propagation (?) use function call instead of macro Use less expensive instructions E.g. Shift left instead of divide by 2 E.g. Register reuse to avoid load/store Many more H. So, Sp10 Lecture 3 - ELEC8106/6102 20

Ex: Predicated instructions if cond { true_part } else { false_part } more_instr Pseudo-code predicated code Assembly code branch cond goto LF true_part goto LD LF: false_part LD: more_instr (cond) true_part (!cond) false_part more_instr Reduce number of instructions Reduce branch mispredictions Improve Instruction-cache hit rate #instr CPI CPI H. So, Sp10 Lecture 3 - ELEC8106/6102 21

Decreasing CPI Traditional high performance CPU architectures focus on decreasing CPI Reduce data/branch hazards CPI close to 1 Increase IPC (instructions per cycle) Parallel processing CPI < 1, IPC > 1 Implicit (Hidden below ISA) Superscalar Explicit (Exposed through ISA) VLIW Vector processors SIMD H. So, Sp10 Lecture 3 - ELEC8106/6102 22

Superscalar Processors (1) Key Idea: Issues more than 1 instruction per cycle to make maximum use of computing resources Relatively simple, in-order instruction dispatch+execution Dispatch N consecutive upcoming instructions each cycle until data hazard arises Sophisticated, out-of-order dispatch +execution Execute N not-necessarily consecutive instructions per cycle as long as there is available execution unit H. So, Sp10 Lecture 3 - ELEC8106/6102 23

Tomasulo Architecture From Mem FP Op Queue Load Buffers FP Registers Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders Reservation Stations FP multipliers To Mem Common Data Bus (CDB) Adapted from EECS252 H. So, Sp10 Lecture 3 - ELEC8106/6102 U.C. Berkeley 24

VLIW Very Long Instruction Word (VLIW) machines Each instruction is in fact composed of multiple smaller, standard instructions 4 to 8 standard instructions per cycle Compiler looks for instructions from the original program that can be issued at the same cycle and pack them into one mega-instruction No dynamic instruction analysis on hardware EX $ IF reg reg EX $ A simplistic VLIW H. So, Sp10 Lecture 3 - ELEC8106/6102 25

Vector Processors Processor that operates on vectors as basic data type Compared to scalar processor Vector instructions E.g. Add 2 vectors: set_vector_len 64 add vectorr, vectora, vectorb A form of data-parallelism Reduces no. of instructions H. So, Sp10 Lecture 3 - ELEC8106/6102 26

SIMD Single instruction multiple data A class of computation architecture Only one instruction stream is presented, which operates on multiple data streams Vector processing is special form of SIMD in which all data are indeed vectors E.g. Intel s MMX, SSE, SSE2 extensions To implement r1=a1+b1, r2=a2+b2, r3=a3+b3 and r4=a4+b4 in one instruction: add r1,a1,b1,r2,a2,b2,r3,b3,c3,r4,b4,c4 Save no. of instructions May pack 4 8-bit adds into a single 32-bit add Reuse the 32-bit hardware adder (with small modifications) H. So, Sp10 Lecture 3 - ELEC8106/6102 27

Explicit vs Implicit (1) Instruction Set Architecture (ISA) is the contract between the software and hardware The hardware guarantee certain behavior to the software according to the ISA E.g. if an instruction i1 comes before instruction i2, then the effect of i1 will definitely be reflected when i2 is executed Without changing the ISA, the hardware must extract all the instruction-level parallelism (ILP) behind the scene yet keeping the promised behavior to software Very complicated hardware design Keeping the ISA maintain binary compatibility Applications compiled to run on an Intel 8086 can still be run on a modern Intel Core i7!!! Good division of labor easy development Change in HW won t affect SW SW cannot foresee data-dependent run-time behavior of the program H. So, Sp10 Lecture 3 - ELEC8106/6102 28

Explicit vs Implicit Exposing the underlying parallel architecture to software allows software to bear the burden of extracting parallelism from the application simple hardware Software can take a long time to do the best job because it is a one-off effort Any change to the hardware requires major change to the software tools No division of labor Data-dependent behavior cannot be anticipated during compile time SW cannot fully exploit all possible parallelization opportunities H. So, Sp10 Lecture 3 - ELEC8106/6102 29

Performance Summary Key to computer performance: L = no. of instrs CPI f clk Clock frequency determined by circuit implementations The number of instructions and CPI both depends on the tight interaction between the compiler and the computer micro-architecture Implicit parallelism hidden behind the ISA puts the burden on low-level hardware implementations to extract ILP Explicit parallelism expose underlying architecture to the compiler and leave the burden to software to extract ILP H. So, Sp10 Lecture 3 - ELEC8106/6102 30

POWER AND ENERGY H. So, Sp10 Lecture 3 - ELEC8106/6102 31

Power and Energy Power consumption of a circuit is the energy consumed per unit time Power measure how much energy is being used/ dissipated at any one time Affects heat dissipation Affects input power supply Slightly affect battery lifetime Energy consumption is the measure of the absolute amount of energy used to perform certain operation Affects battery capacity Concerns embedded system designers Both metrics important for RC designs Some techniques lower power but not energy H. So, Sp10 Lecture 3 - ELEC8106/6102 32

Power, Energy and Performance Power Consumption Activity factor (amount of circuit switching) Load Capacitance (size of circuit) Voltage Swing Supply Voltage Clock frequency P total = α( C L V sw V dd f ) clk + I sc V dd + I leakage V dd Dynamic Static Energy per operation Total Energy Consumption E op P dyn / f clk = α C L V sw V dd E total = E op no. of operations Total Run Time T total = no. of operations CPI / f clk H. So, Sp10 Lecture 3 - ELEC8106/6102 33

Dynamic Power Dissipation V dd E 0 1 = C L V dd 2 E R = 1 C V 2 2 L dd V in C L V out E C = 1 C V 2 2 L dd Energy stored from V dd to C L during 0 1 transition Energy drained from C L to ground during 1 0 transition In the absence of static/leakage power consumption, the capacitance keeps the energy stored until discharged H. So, Sp10 Lecture 3 - ELEC8106/6102 34

Dynamic Power Consumption P dynamic = Energy/transition transition rate P(transition) = C L V 2 dd f clk α = αc L V 2 dd f clk = C eff V 2 dd f clk Power dissipation depends on data input statistics The more data transitions, the more power is consumed H. So, Sp10 Lecture 3 - ELEC8106/6102 35

Switching activities Both input switches randomly: i.e. 50% chance that it has 0 1 transition Probability that Q has a 0 1 transition: A B Q=A&B 0 0 0 0 1 0 1 0 0 1 1 1 AND gate P(Q 0 1 ) = 1 4 3 4 = 3 16 H. So, Sp10 Lecture 3 - ELEC8106/6102 36

Transistor Leakage Transistors are not completely turned off even when they should be. Main contribution from sub-threshold current function of V th and V dd V dd V dd I leak V in = V dd V out V in = 0 V out C L C L Should be OFF I leak H. So, Sp10 Lecture 3 - ELEC8106/6102 37

What are the Options? Power Consumption Activity factor (amount of circuit switching) Load Capacitance (size of circuit) Voltage Swing Supply Voltage Clock frequency P total = α( C L V sw V dd f ) clk + I sc V dd + I leakage V dd Dynamic Static Energy per operation Total Energy Consumption E op P dyn / f clk = α C L V sw V dd E total = E op no. of operations Total Run Time T total = no. of operations CPI / f clk H. So, Sp10 Lecture 3 - ELEC8106/6102 38