The Elusive Metric for Low-Power Architecture Research

The Elusive Metric for Low-Power Architecture Research Hsien-Hsin Hsin Sean Lee Joshua B. Fryman A. Utku Diril Yuvraj S. Dhillon Center for Experimental Research in Computer Systems Georgia Institute of Technology Atlanta, GA 30332 Workshop for Complexity-Effective Design, San Diego, CA, 2003

Background Picture Energy-Delay product (EDP) [Gonzalez & Horowitz 96] Power is meaningless ( frequency) Energy per instruction is elusive ( CV 2 ) Energy Delay (J/SPEC or J IPC) is better 3 Use Alpha-power model, CVdd ED (V Note that no physical meaning of EDP Widespread adoption De facto standard by community Metric for energy and complexity effectiveness New architectural techniques have arrived New hardware exploiting low-power opportunities Temperature-aware power detectors Voltage & Frequency Scaling Multi-threshold voltage dd -V th ) α 2

Outline of the Talk Potential pitfalls Yeah, we all know, it is obvious. but Which E goes in ED product? Impact of new hardware (more transistors) Methodology matters in deep submicron processes Observations Summary 3

Calculating ED Product New architecture solutions save energy at the expense of (insensitive) performance loss A number of research results were reported in the following manner: Technique X for Data Cache Reduce 50% energy of Data Cache Lose 20% IPC EDP = (1-0.5) (1+0.2) = 0.60 Very Energy efficient Technique Y for Branch Predictor Reduce 10% energy of Branch Predictor Lose 20% IPC EDP = (1-0.1) (1+0.2) = 1.08 Energy inefficient 4

So What is E and What is D in EDP? Hypothetical black box Battery (i.e. E) shared by CPU, DRAM, chipsets, graphics, TFT, Wi-Fi, HDD, flash disk D typically account for some system effect such as DRAM latency Improvement proposed: Remove 5% of E from flash disk No delay incurred Is this a good design decision? Flash disk is 10% of total E in system Improvement amounts to 0.5% system impact In-the-noise improvement Is the complexity worth the effort? So, is EDP used in the right way? And is EDP so important? Gfx card flash C.S. 802.11 TFT Display Battery DDR- DRAM HDD 5

Energy Efficiency: E versus D Maxmum Delay Tolerance 100 10 1 0.1 0.01 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% 0.001 0.0001 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Power Distribution of a FU w.r.t. target system 6

Example: Energy Efficiency: E vs. D Maxmum Delay Tolerance 100 10 1 0.1 0.01 Esaved=99% Esaved=90% Esaved=58% Esaved=50% Esvaed=30% Esaved=10% Esaved=5% Tolerate ~25% performance loss 0.001 0.0001 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Distribution w.r.t. target system 7

Using EDP: Pentium Pro Maximum Delay Tolerance 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 IFU (22%) IEU (14%) ROB, DCU (11.1%) RS, FPU, Global Clock (7.9%) RAT, MOB (6.3%) BTB (4.7%) Data Source: [Brooks et al. 00] Assume 100% for CPU 40% IFU power reduction can tolerate < 10% performance loss 0.02 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Saved for a functional unit u 8

But CPU is not 100% of a System Maximum Delay Tolerance 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 CPU=100% CPU=75% CPU=50% CPU=25% 0 0.10.20.3 Energy Distribution of µ w.r.t. CPU only 0.4 0.50.60.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy Saving for a functional unit µ 9

Case Study: Filter Cache [Kin et. al 97,00] The Filter Cache design as reported 58% Energy savings in L1 Caches 21% IPC degradation ED product as shown (1-0.58)(1+0.21) << 1 suggests this is a winning design Question is which E? 10

Filter Cache: E Values Maximum Delay Tolerance 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FC slowdown 21% Esaved = 58% [Kin et al. 00] FilterCache CPU=100% CPU=70% CPU=50% CPU=25% FilterCache SA-110 (I$+D$=43%) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Energy distribution for a functional unit u wrt CPU only Use StrongARM 110 43% ( ) energy by Caches 27% in I-CACHE 16% in D-CACHE CPU=X% stands for X% of overall power drawn by CPU Delay Tolerance 33% : CPU=100% 21% : CPU=70% 14% : CPU=50% 6% : CPU=25% Not energy-efficient if CPU < 70% 11

Rethinking EDP: Switching Activity vs. New Hardware Ignore leakage and short-circuit power Dynamic switching power is dominant The E would be below T: Transistor count f: frequency P dyn = a f C V 2 dd = a f C g avg T V 2 dd P dyn ref P dyn new a ref f T a new ( f + f ) ( T + T ) 12

ED Variables The elegant ratio governing E aref f T f T 1+ + + a f T f T new To include the application delay, D aref f T D 1+ + 1 + a new f T D Can be applied to Macromodeling to determine the trade-off between transistor count and performance degradation 2 13

Impact of Additional Transistor Count 50 45 30% switching reduced 25% switching reduced 10% switching reduced 50 45 30% switching reduced 25% switching reduced 10% switching reduced 40 40 35 35 % Impact on D 30 25 20 15 10 % Impact on f 30 25 20 15 10 5 5 0-35 -30-25 -20-15 -10-5 0 5 10 15 20 25 30 35 40 45 % Impact on T (given freq. unchanged) % Impact on T (given delay unchanged by frequency scaling Given a new avg switching probability of new architecture LHS: Trading transistors with delay given no freq. scaling RHS: Delay recovered by freq. scaling 0 0 5 10 15 20 25 30 35 40 45 50 14

Role of Leakage Energy As Deep Sub-Micron (DSM) era is upon us... More than 50% power from leakage Source: Intel Corp. Custom Integrated Circuits Conference 2002 Leakage ignorance could revert conclusion Early architecture evaluation Leakage cannot be isolated from switching during evaluation Additional HW can be harmful 15

Evaluate the Leakage when adding HW in Early Stage of Arch Definition Example: Dual-speed pipeline [Pyreddy and Tyson 01] Idea appears to be plausible Identify critical instructions [Tune et al 01] [Seng et al. 01] Two datapaths: fast and slow Critical inst fast pipe; remainder to slow Slow pipe consumes less E than fast pipe E.g. Multi-voltage supply, lower frequency Let s evaluate and assume: N instructions; x slow datapath (N-x) fast datapath How does leakage impact efficiency? What x value to achieve energy efficiency? x% inst non-critical slow 1-x% inst critical fast 16

Dual Datapath Leakage Impact Minimum instructions to Slow Datapath 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Today r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Static-to-Total Energy Ratio r is power ratio of slow vs. fast A small r impair performance Soon to be Slow path becomes critical path 17

Dual Datapath Leakage Impact 0.5 Minimum instructions to Slow Datapath 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Today Soon to be r = 0.9 r = 0.75 r = 0.60 r = 0.5 r = 0.4 r = 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r is power ratio of slow vs. fast A small r impair performance Slow path becomes critical path % of non-critical inst needed for slow datapath Today: ~17% Soon: ~40% Static-to-Total Energy Ratio 18

20 15 10 5 0-5 -10-15 -20-25 -30-35 -40 Energy Savings v. # Inst of Slow Path r = 75% r = 50% 20 15 10 5 0-5 -10-15 -20-25 -30-35 -40-45 -50-55 -60 Static-to-Total=1% -45 Static-to-Total=20% Static-to-Total=33% -50 Static-to-Total=50% Static-to-Total=67% -55 Static-to-Total=75% -60 0 0.1 0.2 0.3 0.4 X-axis : % of instructions to non-critical datapath Y-axis : % Energy saved If send 30% instructions to non-critical datapth Static-to-Total=1% Static-to-Total=20% Static-to-Total=33% Static-to-Total=50% Static-to-Total=67% Static-to-Total=75% 0 0.1 0.2 0.3 0.4 Only save ~5% energy (savings only on datapath) in DSM for r=75% Consume more energy in DSM for r=50% Is the extra complexity paid off? 19

Observations It is insufficient to examine ED product on a microscale; the entire system must be examined. Adding HW complexity for low energy needs to be evaluated thoroughly If the target process is not DSM, ED product can be examined via simplified ratio analysis For DSM process Leakage must be accounted for in local and system E Additional HW could be an overkill 20

Summary Low-power architecture research: Metric could be elusive Methodology More susceptible to reverse conclusions than performance research, if not meticulously applied 2nd order effect today 1st order effect tomorrow Complexity can be ineffective in energy reduction Purposes of our study Provide analytical models and methodology for early evaluation No intention to invalidate prior results WCED WDDD Raise more discussions To get it right in education 21

That s s All Folks! 22