Design Design Methodologies and Techniques: An Overview Department of EE-Systems University of Southern California Los Angeles CA 90064 pedram@ceng.usc.edu Outline Introduction Analysis/Estimation Techniques Synthesis/Optimization Techniques Summary Page /LP
Design What Is the Power Problem? Power: Biggest challenge facing the industry Ö Power goes up 2X per generation for high end CPU /analysis tools are ecellent at enabling tradeoffs Ö But what if performance cannot be traded-off? Ö Need innovative solutions to reduce Power 40 A264-300 HP PA8000 35 30 25 A2064A UltraSparc-67 PPro-50 HP PA7200 PPro200 MIPS R0000 Power(W) 20 5 0 5 SuperSparc2-90 PPC 604-20 PP-66 MIPS R5000 MIPS R4400 PP66 PPC 60-80 PP-00 PP-33 486-66 i486c-33 DX4 00 PPC603e-00 0 i386 i386c-33 '9 '93 '94 '95 '94 '96 [Source: Microprocessor Report] Process Technology Trend 997 999 200 2003 2006 2009 202 Min. Feature Size (μm) 0.25 0.8 0.5 0.3 0. 0.07 0.05 Threshold Voltage (V) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 T o (nm) 5 4 3 3 2.5 C interconnect (af/μm) 50 36 29 2 20 6 4 Norminal I on (μa/μm)(nmos/pmos) 600/280 600/280 600/280 600/280 600/280 600/280 600/280 Ma I off (μa/μm) (For minimum L device) 3 3 3 0 0 Supply Voltage (V) 2.5.8.8.5.2 0.9 0.6 On-chip across-chip clock freq. (MHz) 350 526 727 928 08 468 827 Off-chip peripheral bus freq. (MHz) 75/75 00.263 00/362 25/464 25/554 50/734 50/93 Usable transistors (millions / cm 2 ) 8 4 6 24 40 64 00 Chip size (cm 2 ) 3 3.4 3.85 4.3 5.2 6.2 7.5 Chip pad count 256-800 300-976 352-93 43-458 524-968 666-2656 846-3578 Cost-Perf. Designs - notebooks, desktop personal computers, telecom Data etracted from NTRS 97 /LP Page 2
Design Design Technology Trend Power (watt) 000 Vcc (volt) 0 Current (Amp) 000 00 00 0. 0 30 00 70 Process Minimum Geometry (nm) Based on Information provided by Shakhar Borkar, Intel Data etracted from EDAR 99 Eample Applications Portable Electronics (PC, PDA, Wireless) IC Cost (Packaging and Cooling) Reliability (Electromigration, Latch-up) Signal Integrity (Switching Noise, DC Voltage Drop) Thermal Design Data Path Clock Where Does the Power Go? Varies from one design to net Memory Control, IO Data Path Clock Memory Control, IO /LP Page 3
Design What Has Worked? Voltage and process scaling (3/Generation) Design methodologies Ö Power management through HW/SW, trade area for lower power Architecture Design Power down techniques ÖClock gating, dynamic power management Dynamic voltage scaling based on workload Power conscious RT/ logic synthesis Better cell library design and resizing methods Ö Cap. reduction, threshold control, transistor layout Opportunities for Power Savings System Behavioral RT-Level Logic > 70 % 40-70 % 25-40% 5-25 % HW/SW Co-design Custom ISA Algorithm Design Communication Synthesis Scheduling, Binding Pipelining Behavioral Transformations Clock Gating, Precomputation Operand Isolation State Assignment, Retiming Logic Restructuring Technology Mapping, Rewiring Pin Ordering & Phase Assignment Physical 0-5 % Savings Fanout Optimization, Buffering Transistor Sizing, Placement Partitioning, Clock Tree Design Glitch Elimination Page 4 /LP
Design System Realistic Estimation Epectations Seconds > 50 % Instruction-Level Models IP Core Models Stochastic Models Program Simulation Architectural Minute 25-50 % Entropic Bounds Architectural Simulation I/O and Memory Accesses RT-Level Minutes 5-30% RT-Level Macromodels HDL Simulation Quick Synthesis Gate Hour 0-20 % Probabilistic Simulation Gate-Level Simulation Sampling and Compaction ASIC Library Models Transistor Hours Speed 5-0 % Error Parasitic Etraction Accurate Timing Analysis Circuit-Level Simulation Sources of Power Consumption in CMOS Power Consumption Active Power Standby Power Short-Circuit Capacitive Subthreshold P-N Junction Interconnect Gate Internal Diffusion /LP Page 5
Design Power Dissipation Equations CMOS circuits only dissipate power when node voltages are changing V DD C short circuit charge 2 DD P = 0.5CV fn + Q V fn + I SC DD leak f: frequency of clocking N: number of times gate switches in a clock cycle V DD leakage current Power Estimation Techniques Static (non-simulative) - useful for synthesis and architectural eploration Ö Probability-based Ö Entropy-based Dynamic (simulative) - useful for final power evaluation and validation Ö Direct (flat and hierarchical) Ö Sampling-based Ö Compaction-based Hybrid (high-level simulation + low-level analytical model evaluation) Ö Power macromodels for datapath, control, memory Ö Instruction-level models for microprocessors, DSPs /LP Page 6
Design Probabilistic Power Estimation Capture primary input and register output statistics Ö E.g., signal probability, switching activity, spatiotemporal correlation Obtain load capacitances for all nodes Ö Estimate at the RT and logic level Ö Etract after layout Propagate input statistics throughout the circuit Ö Account for reconvergent fanout, Ö Use appropriate delay model Ö Iterate in the case of finite state machines Calculate node power and hence total power Zero Gate Delays 0 g 0 0 g 2 0 f 0 g 0 g 2 0 f g g 2 0 0 f 0 Page 7 /LP
Design General Gate Delays A gate may switch many times for a vector pair 2 Static Probabilities Define p g as static probability of g = A B C I J K For AND gate: p I = p A p B For OR gate: p J = - ( - p B ) ( - p C ) Above equations assume that A, B and C are uncorrelated Page 8 /LP
Design Computing Static Probabilities f = a b + b c Write f as a disjoint sum-of-products epression where each product-term has a null intersection with any other Then, f = a b + a b c p f = p a p b + ( - p a ) p b p c Spatial Correlations A B C I J K p K p I p J since I and J are correlated - both depend on B Need to compute static probabilities taking into account the spatial correlations Page 9 /LP
Design Temporal Correlations 0000000000 0000000000 I J K p K = p I p J For temporally uncorrelated data, sw K = 2 p I p J (- p I p J ) For temporally correlated data, this is not true E.g., every on input I is immediately followed by a 0 Need to compute switching probabilities taking into account the temporal correlations Sequential Circuits Primary Inputs Present State Lines PI PS Combinational Logic FF PO NS Primary Outputs Net State Lines Two issues for sequential circuits: Probability of machine being in a particular state Correlation in time: Given a primary input and present state, net state is uniquely determined by the net state combinational logic Solutions: Solve C-K Equations for calculating steady state probabilities Unroll the state machine Page 0 /LP
Design Reducing the Simulation Time RT-Level or Gate-Level Netlist Probabilistic Compaction Compacted Sequence Input Sequence Statistical Sampling Sampled Sequence Event or Cycle-Based Simulation Report Power Library Data Simple Random Sampling (SRS) Input vectors Generate one sample of k units NO Efficiency: depends on population characteristics and sampling procedure Do circuit level simulation Calculate mean power & confidence interval Converged? YES Difficult distributions Report Power /LP Page
Design SRS Results C7552 C6288 C535 C3540 C2670 C908 C355 C880 C432 Biased Sequence Random Sequence 0 2000 4000 6000 8000 Stratified Random Sampling (StRS) Estimate the transistor-level power, assuming that the zero-delay power estimate is readily available Input vectors (population) Zero-Delay Simulation of the Entire population Population Stratification Lower sample variance leads to faster convergence Random Sampling and PowerMill Simulation Report Power Convergent? YES NO /LP Page 2
Design Sampling on FSMs Method : Functional simulation + sampling on comb. circuit Input Seq. Logic Input Seq. + State Lines Logic FFs Method 2: Do machine warm-up for every sampling unit Warm-up vectors + sampling unit Logic FFs Warm-up sequence length can be quite large Probabilistic Compaction (PC) Sequence compaction : Generate a new, but shorter, input sequence which ehibits nearly the same spatiotemporal correlations as the initial sequence Probabilistic Model in out Initial Seq. Sequence Generation New Seq. Target Circuit /LP Page 3
Design PC by Eample Initial sequence 0000000 000000000 00000000 Avg. bit activity 23/6 Compacted sequence 0000 0000 0000 Avg. bit activity 3 /2 /2 8/5 Stochastic State Machine Model /4 6 /3 /4 7 /2 0 /2 /2 /2 5 /2 2/3 /2 4 /2 Comparison with SRS L = 4,000 Compaction Ratio 0 circuit C6288 C3540 C908 C355 C880 C432 0% 2% 4% 6% 8% SRS Hierarchical /LP Page 4
Design Compaction for FSMs Input Sequence Target Circuit n z n = out ( n, s n ) Markov Chain Logic p( n s n ) s n+ = net ( n, s n ) s n FFs Dynamic Markov Trees S = (0000, 000, 00, 00, 00, 00, 00, 00) DMT 0 2 6 2 2 3 2 3 2 3 3 3 3 3 4 4 3 3 4 DMT 2 6 2 2 3 2 3 2 3 3 3 3 3 4 4 3 3 4 5 5 5 5 2 6 6 3 6 6 7 7 3 2 3 7 7 8 8 8 8 2 3 2 Upper subtree Lower subtrees /LP Page 5
Design Higher Order DMTs A lag-k Markov chain which correctly models the input sequence, also models the joint k-step conditional probabilities of the primary inputs and state lines p(v i ) v i DMT 0 DMT p(v j v i ) v j DMT 2 p(v k v j v i ) v k High Order DMT Results s9234 s820 s5378 s423 s96 shiftreg planet mc dk7 bbara Compaction Ratio 0 Order Order 2 0% 0% 20% 30% 40% 50% 60% /LP Page 6
Design Power Characterization Transistor-Level Netlist Gate-Level Netlist SPICE Simulation Event-Driven Simulation SPICE Parameters RT-Level Power/Delay Characterization Data Gate-Level Power/Delay Characterization Data Power Macromodeling Training Set Module Macromodel Equation Form Model Variables Low - Level Simulation Regression Analysis Analytic Model Reduction Macro Models Regression Coefficients /LP Page 7
Design Dual Bit Type Model Consider a data path block: Sign bit Pwr = C 0 + C S + C 2 S 2 + C 3 S 3 + C 4 S 4 Module MSB region LSB region S S 2 : avg. switching activity of LSB (MSB) region of operand S 3 S 4 : avg. switching activity of LSB (MSB) region of operand 2 Bitwise Data Model Consider a random logic block: Pwr = C0 + inputs C i S i S i : avg. switching activity of input signal i Module More parameters lead to a higher degree of accuracy, but increase the computational overhead /LP Page 8
Design Cycle-Accurate Macro-Models Average Power Peak Power Power Distribution Cycle- Accurate Macro- Models Signal Integrity Switching Noise DC Voltage Drop Macro-Model Construction Starting point: an eact power consumption function Pwr = f ( V, V 2 ) = Order terms Equation simplification + Order 2 + + terms Order n terms module module module Eact function After order reduction After input grouping /LP Page 9
Design Variable reduction: find the most significant variables, by using a statistical test Regression Analysis module After variable reduction Population stratification: makes the macro-model accuracy less sensitive to different input conditions Training set design: reduces the macro-model bias A general linear regression model: Pwr = C0 + k i = C i X i Integrated Macro-Modeling Flow Powermill Netlist Powermill Config. Technology File Input Spatial Correlation Analysis for Primary Inputs Training Set Generation and Simulation Circuit Simulator (Powermill) Program Flow Variable Reduction and Curve Fitting Output Library Generation Macro-model Library /LP Page 20
Design Macro-Modeling Results 4 3 EAP (%) 2 0 C355 C908 C2670 C3540 C432 C535 C6288 C7552 Average EAP = 2.03% C880 Mul6 ADD6 20 5 0 5 0 C355 C908 C2670 C3540 C432 C535 C6288 C7552 Average ECP = 0.04% C880 Mul6 ADD6 ECP (%) Adaptive Macro-Modeling Static macro-model Training set Adaptive macro-model Training set Macro-model generation Macro-model Cycle-based simulation Power estimates Low-level simulation Random sampling Macro-model generation Macro-model Cycle-based simulation Power estimates /LP Page 2
Design Standard Cell Characterization SPICE Netlist Configuration File Command Scripts Technology Data Input Function Validation Characterization File Generation and Simulation Circuit Simulator (HSPICE) Program Flow Generalized Data Base (Timing, Power, Capacitance, Wire load, etc.) Library Eportation Synopsys Lib. VHDL Simulation Lib. Verilog Simulation Lib. Customized Lib. Output Design Tool Characteristics Accurate model Ö Include both output load (C) and input slope (S) Ö Nonlinear equation f=k 0 +k *C+k 2 *C 2 +k 3 *S+k 4 *C*S Ö Variable domain stratification, different sets of coefficients for different strata Ö Default sensitization method is pin-to-pin, can be etended to pattern-dependent by customization Fast model Ö Small equation form Ö Can be customized to full table-lookup library Easy library maintenance Ö Manage the libraries through Graphic User Interface Ö Generalized data base enables data reuse /LP Page 22
Design A Power Estimation Methodology Database Input Vector Sequences Cell Library Interconnect Estimates Multipleers Interconnect Estimate Power Memory Hard IP Blocks Glue Logic Controller Behavioral/RTL Simulation Design Planning Clock Tree Datapath Soft IP Blocks HDL Description Library Models Macromodels Analytic Models Prob. Analysis Low Level Simulation Sampling/Compaction RTL Synthesis Flow RTL Code RTL Synthesis Event or Cycle-Based Simulation Power Macromodeling Switching Activity Calculator RTL netlist Power Optimization Power Optimized Netlist Data Trace for Ports and Registers /LP Page 23
Design RTL Optimization for YResource Assignment ÖModule and/or register allocation and binding YClock Gating YState Encoding ÖTwo-level and multi-level logic realizations YMultiple Supply Voltage Scheduling ÖPipelined and non-pipelined designs Register Assignment for Avoid value change at the input of functional units during idle cycles a R 0 R / R 3 R 2 + b R a b a Z = + + c = ( + b) + c 2 R 0 R 3 / R 2 c R 3 + R Z /LP Page 24
Design An RTL Structure a b c Cycle R R 2 R 3 / + R 2 / R 3 + R 2 3 4 5 a b b c < a, > a a < R or R3, > < b, > a a + b < + b, > a b a b 2 + < R or R3, > < +, c > 2 OUT Can insert an isolation latch here Z a A RTL Structure Cycle R 0 R R 2 R 3 / + R 0 / R 3 R 2 b c R 2 3 4 5 a a a + b a + b b b c c a a + b a b 2 + < a, > < a, > a < + b, > a < + b, > a < b, > a b < +, c > 2 + OUT Z /LP Page 25
Design Module Assignment for O O O 2 3 Do module binding to minimize the input toggling of modules = a + b y = c + d y = e + f y a a b y e + + f y c d + y b + y c + d e y f + y O 3 O O 2 O O 2 O 3 Gated Clocks Disable clocking of input registers if the current instruction does not access it Clock Instruction Register Decode Logic Clock Registers A B ALU /LP Page 26
Design Logic Synthesis Flow Gate Netlist Cell Library Logic Synthesis Power Information from Cell Library Mapped Netlist Power Optimization Event or Cycle-Based Simulation Switching Activity Information Power Optimal Netlist Logic Optimization for YCombinational logic optimization ÖTechnology Mapping ÖGate Resizing ÖPin Assignment YSequential logic optimization ÖPrecomputation ÖState Re-encoding ÖPower management for control units Page 27 /LP
Design Technology Mapping Outputs of nodes of combinational circuit mapped to library gates under the cost function of load capacitance multiplied by the switching activity To minimize power dissipation, high switching activity points are either hidden within gates or driven by smaller gates Minimum power realization under zero delay model can be obtained using dynamic programming Circuit and Gate Library 0.09 0.79 0.09 Gate Area Intrinsic Input Load Capacitances Capacitances INV 928 0.029 0.054 NAND2 392 0.42 0.0747 AOI22 2320 0.340 0.033 /LP Page 28
Design Minimum Area Mapping AOI22 0.79 Area = 2320 + 928 = 3248 Power = 0.79 (0.340 + 0.054) + 0.79 (0.029) = 0.0887 Note: Ignore capacitances at circuit inputs Minimum Power Mapping 0.09 0.09 WIRE 0.79 Area = 392 * 3 = 476 Power = 0.09 (0.42 + 0.0747) * 2 + 0.79 (0.42) = 0.0726 /LP Page 29
Design Sequential Precomputation Selectively precompute the outputs of the logic circuit one clock cycle before they are required and use the precomputed values to reduce switching activity in the net clock cycle Original Circuit X X 2 X n R A R 2 f Predictor functions: g = g 2 = f = f = 0 /LP Page 30
Design Subset Input Disabling Architecture X X 2 X n R A R 3 R 2 LE f g g 2 g = g 2 = f = f = 0 A Comparator Eample C<n-> D<n-> C<n-2> D<n-> C<0> D<0> R C > D R 3 R 2 LE f g = C<n-> D<n-> g 2 = C<n-> D<n-> /LP Page 3
Design Layout Optimization for and Noise Floorplanning and/or Placement Gate Resizing Gated Clock Tree Design Driver Design Chip-Package Interface Design Power Bus Planning and Sizing Gated Clock Tree Design Controller - sinks - Steiner points clock tree edges source gate control signals Ö Clock tree : binary tree Ö Controller tree : star tree /LP Page 32
Design Clock Tree Layout source Steiner nodes Sinks Merging sectors Ö Use the Deferred Merge Embedding algorithm of to layout the clock tree Ö The bottom-up merging process is guided by the minimum switched capacitance heuristic Ö Top-down phase determines the eact locations of the Steiner nodes in their respective merging sectors Buffered clock tree versus Gated clock tree 800 600 400 200 0 Switched Capacitance r r2 r3 r4 r5 Buffered Gated Gate Red. 70 60 50 40 30 20 0 0 Area r r2 r3 r4 r5 Ö The average module activity was set to 40% /LP Page 33
Design Conclusions Analysis tools : Ö Simulation based techniques provide the best accuracy Ö Transistor level power estimation is mature; a satisfactory gate and RT level simulation engine is needed Ö Probability-based techniques are good for relative accuracy analysis and to guide the synthesis engine Optimization Tools: Ö Gate level optimization / re-mapping provides 20-40% power reduction Ö Behavioral and RTL power synthesis provides 40-60% power reduction Ö System-level optimization provides 2-5 X power reduction Power analysis, tradeoffs and optimizations for design levels higher than RTL are not mature /LP Page 34