The Elusive Metric for Low-Power Architecture Research

Similar documents
CIS 371 Computer Organization and Design

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Lecture 2: CMOS technology. Energy-aware computing

Lecture 2: Metrics to Evaluate Systems

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

Implications on the Design

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Where Does Power Go in CMOS?

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

EECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders

Lecture 15: Scaling & Economics

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Low power Architectures. Lecture #1:Introduction

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

EE115C Winter 2017 Digital Electronic Circuits. Lecture 6: Power Consumption

Energy Delay Optimization

Performance, Power & Energy

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Variation-Resistant Dynamic Power Optimization for VLSI Circuits

Grasping The Deep Sub-Micron Challenge in POWERFUL Integrated Circuits

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Dynamic operation 20

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

Variations-Aware Low-Power Design with Voltage Scaling

! Memory. " RAM Memory. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Scaling of MOS Circuits. 4. International Technology Roadmap for Semiconductors (ITRS) 6. Scaling factors for device parameters

Administrative Stuff

Modern Computer Architecture

Lecture 4: Technology Scaling

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

CSE493/593. Designing for Low Power

ICS 233 Computer Architecture & Assembly Language

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Optimal Voltage Allocation Techniques for Dynamically Variable Voltage Processors

Lecture 6 Power Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

EEC 118 Lecture #6: CMOS Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Power Dissipation. Where Does Power Go in CMOS?

Analog and Telecommunication Electronics

Intro To Digital Logic

Tradeoff between Reliability and Power Management

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Impact of Scaling on The Effectiveness of Dynamic Power Reduction Schemes

COMP 103. Lecture 16. Dynamic Logic

Scheduling for Reduced CPU Energy

Lecture 10, ATIK. Data converters 3

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

ASIC FPGA Chip hip Design Pow Po e w r e Di ssipation ssipa Mahdi Shabany

Timing Issues. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić. January 2003

VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects

VLSI Design I; A. Milenkovic 1

Lecture Outline. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. Review: 1st Order RC Delay Models. Review: Two-Input NOR Gate (NOR2)

Digital Integrated Circuits A Design Perspective

GMU, ECE 680 Physical VLSI Design 1

CMOS Digital Integrated Circuits Lec 13 Semiconductor Memories

The challenges of Power estimation and Power-silicon correlation. Yoad Yagil Intel, Haifa

Chapter 8. Low-Power VLSI Design Methodology

CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When

Lecture 7 Circuit Delay, Area and Power

Chapter 1. Binary Systems 1-1. Outline. ! Introductions. ! Number Base Conversions. ! Binary Arithmetic. ! Binary Codes. ! Binary Elements 1-2

Profile-Based Adaptation for Cache Decay

Introduction to Side Channel Analysis. Elisabeth Oswald University of Bristol

τ gd =Q/I=(CV)/I I d,sat =(µc OX /2)(W/L)(V gs -V TH ) 2 ESE534 Computer Organization Today At Issue Preclass 1 Energy and Delay Tradeoff

Moore s Law Technology Scaling and CMOS

Lecture 6: Time-Dependent Behaviour of Digital Circuits

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

EEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors

Lecture 12 Digital Circuits (II) MOS INVERTER CIRCUITS

LECTURE 28. Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State

Statistical Modeling for the Minimum Standby Supply Voltage of a Full SRAM Array

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

TU Wien. Energy Efficiency. H. Kopetz 26/11/2009 Model of a Real-Time System

Temperature Aware Floorplanning

Skew-Tolerant Circuit Design

Lecture 12 Circuits numériques (II)

EE241 - Spring 2001 Advanced Digital Integrated Circuits

MODULE 5 Chapter 7. Clocked Storage Elements

Pipelining and Parallel Processing

Semiconductor memories

TOWARD THE PLACEMENT OF POWER MANAGEMENT POINTS IN REAL-TIME APPLICATIONS

CMOS INVERTER. Last Lecture. Metrics for qualifying digital circuits. »Cost» Reliability» Speed (delay)»performance

Managing Physical Design Issues in ASIC Toolflows Complex Digital Systems Christopher Batten February 21, 2006

Lecture Outline. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. Review: CMOS Inverter: Visual VTC. Review: CMOS Inverter: Visual VTC

Spiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp

EEC 116 Lecture #5: CMOS Logic. Rajeevan Amirtharajah Bevan Baas University of California, Davis Jeff Parkhurst Intel Corporation

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

Transcription:

The Elusive Metric for Low-Power Architecture Research Hsien-Hsin Hsin Sean Lee Joshua B. Fryman A. Utku Diril Yuvraj S. Dhillon Center for Experimental Research in Computer Systems Georgia Institute of Technology Atlanta, GA 30332 Workshop for Complexity-Effective Design, San Diego, CA, 2003

Background Picture Energy-Delay product (EDP) [Gonzalez & Horowitz 96] Power is meaningless ( frequency) Energy per instruction is elusive ( CV 2 ) Energy Delay (J/SPEC or J IPC) is better 3 Use Alpha-power model, CVdd ED (V Note that no physical meaning of EDP Widespread adoption De facto standard by community Metric for energy and complexity effectiveness New architectural techniques have arrived New hardware exploiting low-power opportunities Temperature-aware power detectors Voltage & Frequency Scaling Multi-threshold voltage dd -V th ) α 2

Outline of the Talk Potential pitfalls Yeah, we all know, it is obvious. but Which E goes in ED product? Impact of new hardware (more transistors) Methodology matters in deep submicron processes Observations Summary 3

Calculating ED Product New architecture solutions save energy at the expense of (insensitive) performance loss A number of research results were reported in the following manner: Technique X for Data Cache Reduce 50% energy of Data Cache Lose 20% IPC EDP = (1-0.5) (1+0.2) = 0.60 Very Energy efficient Technique Y for Branch Predictor Reduce 10% energy of Branch Predictor Lose 20% IPC EDP = (1-0.1) (1+0.2) = 1.08 Energy inefficient 4

So What is E and What is D in EDP? Hypothetical black box Battery (i.e. E) shared by CPU, DRAM, chipsets, graphics, TFT, Wi-Fi, HDD, flash disk D typically account for some system effect such as DRAM latency Improvement proposed: Remove 5% of E from flash disk No delay incurred Is this a good design decision? Flash disk is 10% of total E in system Improvement amounts to 0.5% system impact In-the-noise improvement Is the complexity worth the effort? So, is EDP used in the right way? And is EDP so important? Gfx card flash C.S. 802.11 TFT Display Battery DDR- DRAM HDD 5

Energy Efficiency: E versus D Maxmum Delay Tolerance 100 10 1 0.1 0.01 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% 0.001 0.0001 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Power Distribution of a FU w.r.t. target system 6

Example: Energy Efficiency: E vs. D Maxmum Delay Tolerance 100 10 1 0.1 0.01 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% Tolerate ~25% performance loss 0.001 0.0001 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Distribution w.r.t. target system 7

Using EDP: Pentium Pro Maximum Delay Tolerance 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 IFU (22%) IEU (14%) ROB, DCU (11.1%) RS, FPU, Global Clock (7.9%) RAT, MOB (6.3%) BTB (4.7%) Data Source: [Brooks et al. 00] Assume 100% for CPU 40% IFU power reduction can tolerate < 10% performance loss 0.02 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Saved for a functional unit u 8

But CPU is not 100% of a System Maximum Delay Tolerance 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 CPU=100% CPU=75% CPU=50% CPU=25% 0 0.10.20.3 Energy Distribution of µ w.r.t. CPU only 0.4 0.50.60.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Saving for a functional unit µ 9

Case Study: Filter Cache [Kin et. al 97,00] The Filter Cache design as reported 58% Energy savings in L1 Caches 21% IPC degradation ED product as shown (1-0.58)(1+0.21) << 1 suggests this is a winning design Question is which E? 10

Filter Cache: E Values Maximum Delay Tolerance 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FC slowdown 21% Esaved = 58% [Kin et al. 00] FilterCache CPU=100% CPU=70% CPU=50% CPU=25% FilterCache SA-110 (I$+D$=43%) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy distribution for a functional unit u wrt CPU only Use StrongARM 110 43% ( ) energy by Caches 27% in I-CACHE 16% in D-CACHE CPU=X% stands for X% of overall power drawn by CPU Delay Tolerance 33% : CPU=100% 21% : CPU=70% 14% : CPU=50% 6% : CPU=25% Not energy-efficient if CPU < 70% 11

Rethinking EDP: Switching Activity vs. New Hardware Ignore leakage and short-circuit power Dynamic switching power is dominant The E would be below T: Transistor count f: frequency P dyn = a f C V 2 dd = a f C g avg T V 2 dd P dyn ref P dyn new a ref f T a new ( f + f ) ( T + T ) 12

ED Variables The elegant ratio governing E aref f T f T 1+ + + a f T f T new To include the application delay, D aref f T D 1+ + 1 + a new f T D Can be applied to Macromodeling to determine the trade-off between transistor count and performance degradation 2 13

Impact of Additional Transistor Count 50 45 30% switching reduced 25% switching reduced 10% switching reduced 50 45 30% switching reduced 25% switching reduced 10% switching reduced 40 40 35 35 % Impact on D 30 25 20 15 10 % Impact on f 30 25 20 15 10 5 5 0-35 -30-25 -20-15 -10-5 0 5 10 15 20 25 30 35 40 45 % Impact on T (given freq. unchanged) % Impact on T (given delay unchanged by frequency scaling Given a new avg switching probability of new architecture LHS: Trading transistors with delay given no freq. scaling RHS: Delay recovered by freq. scaling 0 0 5 10 15 20 25 30 35 40 45 50 14

Role of Leakage Energy As Deep Sub-Micron (DSM) era is upon us... More than 50% power from leakage Source: Intel Corp. Custom Integrated Circuits Conference 2002 Leakage ignorance could revert conclusion Early architecture evaluation Leakage cannot be isolated from switching during evaluation Additional HW can be harmful 15

Evaluate the Leakage when adding HW in Early Stage of Arch Definition Example: Dual-speed pipeline [Pyreddy and Tyson 01] Idea appears to be plausible Identify critical instructions [Tune et al 01] [Seng et al. 01] Two datapaths: fast and slow Critical inst fast pipe; remainder to slow Slow pipe consumes less E than fast pipe E.g. Multi-voltage supply, lower frequency Let s evaluate and assume: N instructions; x slow datapath (N-x) fast datapath How does leakage impact efficiency? What x value to achieve energy efficiency? x% inst non-critical slow 1-x% inst critical fast 16

Dual Datapath Leakage Impact Minimum instructions to Slow Datapath 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Today r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Static-to-Total Energy Ratio r is power ratio of slow vs. fast A small r impair performance Soon to be Slow path becomes critical path 17

Dual Datapath Leakage Impact 0.5 Minimum instructions to Slow Datapath 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Today Soon to be r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r is power ratio of slow vs. fast A small r impair performance Slow path becomes critical path % of non-critical inst needed for slow datapath Today: ~17% Soon: ~40% Static-to-Total Energy Ratio 18

20 15 10 5 0-5 -10-15 -20-25 -30-35 -40 Energy Savings v. # Inst of Slow Path r = 75% r = 50% 20 15 10 5 0-5 -10-15 -20-25 -30-35 -40-45 -50-55 -60 Static-to-Total=1% -45 Static-to-Total=20% Static-to-Total=33% -50 Static-to-Total=50% Static-to-Total=67% -55 Static-to-Total=75% -60 0 0.1 0.2 0.3 0.4 X-axis : % of instructions to non-critical datapath Y-axis : % Energy saved If send 30% instructions to non-critical datapth Static-to-Total=1% Static-to-Total=20% Static-to-Total=33% Static-to-Total=50% Static-to-Total=67% Static-to-Total=75% 0 0.1 0.2 0.3 0.4 Only save ~5% energy (savings only on datapath) in DSM for r=75% Consume more energy in DSM for r=50% Is the extra complexity paid off? 19

Observations It is insufficient to examine ED product on a microscale; the entire system must be examined. Adding HW complexity for low energy needs to be evaluated thoroughly If the target process is not DSM, ED product can be examined via simplified ratio analysis For DSM process Leakage must be accounted for in local and system E Additional HW could be an overkill 20

Summary Low-power architecture research: Metric could be elusive Methodology More susceptible to reverse conclusions than performance research, if not meticulously applied 2nd order effect today 1st order effect tomorrow Complexity can be ineffective in energy reduction Purposes of our study Provide analytical models and methodology for early evaluation No intention to invalidate prior results WCED WDDD Raise more discussions To get it right in education 21

That s s All Folks! 22