EE M216A.:. Fall 21 Lecture 8 Energy Delay Optimization Prof. Dejan Marković ee216a@gmail.com Some Common Questions Is sizing better than V DD for energy reduction? What are the optimal values of gate size and V DD? What is the optimal ratio of leakage / switching for min E? Shall we increase or decrease V DD for energy reduction? What is the optimal circuit topology? How many levels of parallelism is good? Etc. EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 2 2
Energy Minimization Problem Goal: achieve the lowest energy under the delay constraint E max Energy Unoptimized design W V DD E min W & V DD W & V DD & V TH D min D max (f clk max ) (f clk min ) Delay EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 3 3 Energy Delay Sensitivity Slope of E D curve around a design point (e.g. (A,B )) E/ A S A = D/ A A = A Energy (A S,B ) A f(a,b ) S B D f(a,b) Delay [D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz,.W. Brodersen, JSSC, Aug 4] EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 4 4
Solution: Equal Sensitivities A fixed point is reached when all sensitivities are equal f(a 1,B) E = S A ( D) + S B D Energy (A,B ) D D f(a,b ) f(a,b) Delay EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 5 5 Circuit Optimization Constraints eference design D min sizing @ V DD max, V TH Energy/op topology A topology B Delay Goal: find optimal E D tradeoff for a logic function EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 6 6
Alpha power based Delay Model Combined with Logical Effort Formulation (*) FO4 delay (norm.) 4 3.5 3 2.5 2 1.5 1.5 simulation model V on =.37 V α d = 1.53.5.6.7.8.9 1 V dd / V dd Fitting parameters V on, α d, K d Effective fanout. V DD = 1.2V, FO4 (V DD ) = 25ps (*) [Sutherland et al., Logical Effort, 1999] EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 7 7 Energy Model Switching energy Leakage energy with: D: the cycle time I (S in ): normalized leakage current with inputs in state S in EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 8 8
Adjusting Switching Energy of a Gate V dd, i V dd, i + 1 i i+1 Sizing W i W par,i W out Supply W i p nom,i W i W wire W i+1 = energy stored on the logic gate i EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 9 9 Optimization Setup eference/nominal circuit Sized for D min @ V DD nom Define delay constraint t D con = D min (1 + d inc /1) En nergy erence Minimize energy under delay constraint V DD scaling (global, discrete, per stage) Gate sizing (W) Threshold adjustment (V TH ) Optional buffering D min Delay EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 1 1
Profitability of Optimization ( E/ D) Gate sizing (W) for equal h eff (D min ) Supply voltage (V DD ) S(V DD ) V DD V TH V DD Threshold voltage (V TH ) S(V TH ) V TH V TH EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 11 11 Circuit Optimization Examples Inverter chain Memory decoder Branching Inactive gates Tree adder Long wires e convergent paths Multiple active outputs addr input m = 16 m = 4 n = n = 12 (A 15, B 15 ) predecoder 3 15 C W m = 2 n = 3 word driver m = 1 n = 255 C L C L S 15 word line (A, B ) C in S EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 12 12
Example: Inverter Chain Properties of inverter chain Single path topology Energy increases geometrically from input to output 1 W 1 =1 W 2 W 3 W N C L Goal Find optimal sizing W = [W 1, W 2,, W N ], supply voltage, and buffering strategy to achieve the best energy delay tradeoff EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 13 13 Inverter Chain: Gate Sizing & Buffering effective fanou ut, h eff 25 2 15 1 5 nom opt d inc = 5% 3% 1% 1% % 1 2 3 4 5 6 7 stage [Ma, Franzon, IEEE JSSC, 9/94] Variable taper achieves minimum energy educe number of stages at large d inc EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 14 14
Inverter Chain: V DD Optimization nom 1. 8.6.4 V dd / V dd.8.2 d inc = 5% % nom opt 1 2 3 4 5 6 7 stage effective fanou ut, h eff 18 15 1% 12 1% 3% 9 6 3 nom opt d inc = 5% 3% 1% 1% % 1 2 3 4 5 6 7 stage Variable taper achieved by voltage scaling V DD reduces energy of the final load first EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 15 15 Inverter Chain: Optimization esults energy reduct tion (%) 1 8 6 4 2 cw gvdd 2Vdd cvdd Sensitivity ( norm) 1..8.6.4.2 cw gvdd 2Vdd 1 2 3 4 5 d inc (%) 1 2 3 4 5 d inc (%) Parameter with the largest sensitivity has the largest potential for energy reduction Two discrete supplies mimic per stage V DD EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 16 16
SAM Decoder Energy Profile Energy (nor rm) addr input 1 8 6 4 2 m = 8 Internal energy peak predecoder word driver m = 4 m = 2 m = 1 3 15 C W C L word line EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 17 17 W vs. V DD for educing Energy Peak energy 1 8 6 4 2 1 cw: E (-52%) 2Vdd: E (-25%) E 8 7 d D min,7 +5% inc =1% 6 E 7-19% 4 D min,7 energy 2 1 2 3 4 5 6 7 8 9 stage 1 2 3 4 5 6 7 8 9 stage V DD less effective than W optimization Buffering also reduces energy peak [B. Amrutur, Ph.D. Thesis, Stanford, 8/99] EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 18 18
Tree Adder: Optimization esults eference: all paths are critical erence D min, E Sizing Opt. 1.1D min,.45e 2 V DD Opt. 1.1D min,.73e Internal energy W more effective than V DD For d inc = 1%: E W = 55%, E 2Vdd = 27% EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 19 19 A Look at Tuning Variables 1% excess delay 3 7% energy reduction 1 8 solid: inverter chain dashed: 8/256 decoder dotted: 64-bit adder 1 5 1 4 (V th min ) Energy reduction (%) 6 4 2 W V dd V th Sensitivity 1 3 1 2 1 1 (V dd max ) (V th ) V th V dd W 2 4 6 8 1 Delay increase (%) 1.5.75 1 1.25 1.5 D / D Peak performance is very power inefficient! EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 2 2
Circuit Level esults: Tree Adder E/E 1.5 S W S Vth =.2 S W =22 S Vdd =1.5 1 S Vth =22 S Vdd =16 65%.5 S W =1 S Vth =1 S Vdd =1 5.5 1 D/D 15 1.5 esult of joint (V DD, V TH, W) optimization: 65% of energy saved without delay penalty 25% better delay without energy cost [D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz,.W. Brodersen, JSSC, Aug 4] Limited range of efficient circuit optimization EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 21 21 A Look at Tuning Variables It is best when variables don t reach their boundary values V DD / V DD Supply voltage 1.75 1.5 1.25 1.75.5 25.25 Threshold voltage reliability Sens(V DD ) = 5 limit Sens(W) = 1 Sens(V TH ) = 1 5 25 25 5 75 1 Delay increment (%) V TH (mv) 15 2 25 3 Limited range of tuning variables 35 5 25 25 5 75 1 Delay increment (%) EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 22 22
A Look at Tuning Variables When a variable reaches bound, fewer variables to work with V DD / V DD Supply voltage Threshold voltage 1.75 reliability Sens(V DD ) = 1.5 5 limit Sens(W) = 1.25 1 Sens(V TH ) = 1 1.75.5 25.25 V TH (mv) 15 2 25 Sens(V TH ) = Sens(V( DD ) = 16 3 Sens(W) = 22 5 25 25 5 75 1 Delay increment (%) Limited range of tuning variables 35 5 25 25 5 75 1 Delay increment (%) EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 23 23 Tree Adder: Joint Optimization energy reductio on (%) 1 8 6 4 2 67% 2Vdd + cw 2Vdd gvdd + cw gvdd cw 2 4 6 8 1 d inc (%) energy reduction (% %) @ D min 5 4 3 2 1 42% 1 1. 115 1.15 nom V dd / V dd lure Device fai Choose a more efficient variable Higher V DD yields a lower energy solution EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 24 24
An Update: Slope Aware Delay Model LE overestimates delay Assumes equal slopes Add slope correction K i : gate &V V DD dependentd N g h g h D = gi hi + pi i= 1 Ki i i i 1 i 1 EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 25 25 esult: Adder Example 16 bits, output load: C L = 512 ff (total FO ~ 256) Logical Effort: largely overestimates non critical paths rnal power (µw) Inter 5 45 4 35 3 A Synthesized Adder B Simulation LE with Slope Corr Synthesis Timing Original LE Model Matlab Optimized Adders 25.8 1 1.2 1.4 1.6 1.8 2 Delay (normalized to the synthesized adder) [C.C. Wang, D. Markovic, TCAS-II, Aug 9] EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 26 26
Lessons from Circuit Optimization Sensitivity based optimization framework Equal marginal costs Energy efficient design To reduce energy, it is not always best to reduce V DD Effectiveness of tuning variables Sizing is the most effective for small delay increments Vdd is better for large delay increments Peak performance is VEY power inefficient About 7% energy reduction for 2% delay penalty Limited performance range of tuning variables Additional variables for higher energy efficiency EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 27 27 Choosing Circuit Topology: Optimal egister? 64 bit ALU C in =2 E G C in =1 in b=8 Add C L =32 Cp D S Cp Cp (a) Cycle-latch (CL) Cycle latch (CL) Q Clk Clk Clk 1 Clk 1SS Q S Q D M M Clk Clk 1 Clk 1 Clk Clk 1 Clk (b) Static Master-slave Latch-pair (SMS) Static Master Slave Latch Pair (SMS) Given energy delay tradeoff for adder and register (two register options), what is the best energy delay tradeoff in the ALU? EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 28 28
Balancing Sensitivity Across Circuit Blocks Upsize lower activity blocks (Add) to save energy 2.5 ALU Add Energy (norm.) 2 1.5 1.5 eg CL SMS Add CL SMS 5 1 15 2 25 Delay (FO4) S Add = S eg Energy eg EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 29 29 Micro Architectural Optimization E-D tradeoff par, t-mux Area E-D trade-off Vdd, Vth, W Macro Arch. Micro Arch. Circuit interleaving folding E parallel pipeline time-mux mux # of bits throughput algorithm delay cct topology EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 3 3 E W Vth Vdd
educing the Supply Voltage (while maintaining performance) Concurrency: trading off clock frequency versus area to reduce power f F1 eference design F2 : register, F1,F2: combinational logic (adders, ALUs, etc) [Chandrakasan et al., JSSC 92] C : average switching capacitance EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 31 31 A Parallel Implementation unning slower lowers required V DD (quadratic power reduction) F1 F2 f /2 F1 F2 Almost cancels f /2 EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 32 32
Parallelism Example (9nm CMOS) Power (assuming ov par = 7.5%) t p (norm.) 5 4.5 4 3.5 3 2.5 2 15 1.5 1.5.6.7.8.9 1 V DD (norm.) From J. abaey (UCB) How many levels of parallelism? EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 33 33 The More Parallel the Better? Total Energy eference Parallel From J. abaey (UCB) Supply voltage, V DD Leakage and overhead start to dominate at high levels of parallelism, causing min E to increase Optimum voltage also increases with parallelism EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 34 34
Increasing use of Concurrency Saturates P Fixed Throughput Nominal design (no concurrency) Overhead + leakage Concurrency P min Only option: educe V TH as well! But: Must consider Leakage V DD From J. abaey (UCB) EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 35 35 A Pipelined Implementation Shallower logic reduces required supply voltage F1 F2 f f This example assumes equal V DD for par / pipe designs Assuming ov pipe = 1% EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 36 36
Mapping into the Energy Delay Space Energy delay for ALU realizations with varying parallelism E Op P = 3 P =4 P = 5 P = 2 erence Fixed throughput Minimum EDP Increasing parallelism (P) Delay = 1/Throughput Level of concurrency depends on target performance ule of thumb: If speed exceeds MEP point, add parallelism EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 37 37 E Op / nominal E Op Leakage is Not Necessarily a Bad Thing E Op for three architectures (, par, pipe) of ALU Set V TH, adjust V DD to maintain performance, plot E op 1.8 V th -18mV max.81v dd.6 V th -95mV max.4.57v dd Topology Inv Add Dec 2.2 nominal V -14mV th (E 8 5 2 parallel max Lk /E Sw ) opt.8.5.2.52v pipeline dd Adapt to process and 1-2 1-1 1 1 1 activity variations E Leakage /E Switching Optimal designs have high leakage (E Lk /E Sw.5)! EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 38 38
Parallelism and Pipelining in E D Space f A B A B f f erence (a) 2 A B A B f f f f (c) pipeline pipeline 2 parallel (b) parallel f Ene ergy/op erence It is important to link back to E D tradeoff parallel/pipeline Time/Op EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 39 39 Time Multiplexing A f f f f A 2f 2f f f f f (a) erence for time-mux (b) time-multiplex erence A time mux Energ gy/op erence time mux For throughputs below min EDP: Absorb unused time slack by increasing Clk freq. (and V DD ) Again comes with some area and capacitance overhead Time/Op EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 4 4
Energy Area Tradeoff High throughput: Parallelism = Large Area 4 3 2 1 1 2 1 3 1 4 1 5 64 b ALU Max Eop A = A = 1 5 1 3 A A Ttarget Low throughput: Time Mux = Small Area EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 41 41 Putting it All Together Balance logic depth (L d ) within a block, adjust latency to reach T Clk Apply V DD and W optimization to the underlying pipelines Micro Arch. (block) Circuit (datapath) Latency add Speed Power Area mult Ld W Energy Min delay S W =,S Vdd =2 W W opt @ V DD S W =S Vdd =2 V DD scaling Target delay S W =S Vdd <2 Cycle time (T Clk ) V DD Delay EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 42 42
Motivation for System Level Optimization Optimizations at the architecture or system level can enable more effective power minimization at the circuit level (while maintaining performance), such as Enabling a reduction in supply voltage educing the effective switching capacitance for a given function (physical capacitance, activity) educing the switching rates educing leakage Optimizations at higher abstraction levels tend to have greater potential impact While circuit techniques may yield improvements in the 1 5% range, architecture and algorithm optimizations have reported orders of magnitude power reduction EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 43 43 Some Energy Inspired Design Guidelines For maximum performance Maximize use of concurrency at the cost of area For given performance Optimal amount of concurrency for minimum energy For given energy Least amount of concurrency that meets performance goals For minimum energy Solution with minimum overhead (direct mapping between function and architecture) EEM216A.:. Fall 21 Lecture 8: Energy Delay D. Optimization Markovic / Slide 44 44