Design and Optimization of Low-power CMOS Logic Using. Logical Effort Model with Slope Correction

Size: px

Start display at page:

Download "Design and Optimization of Low-power CMOS Logic Using. Logical Effort Model with Slope Correction"

Oliver Wade
6 years ago
Views:

1 UNIVERSITY OF CALIFORNIA Los Angeles Design and Optimization of Low-power CMOS Logic Using Logical Effort Model with Slope Correction A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Electrical Engineering By Chengcheng Wang 2009

2 Copyright by Chengcheng Wang 2009

3 The thesis of Chengcheng Wang is approved. Rajeev Jain Mani B. Srivastava Lieven Vandenberghe Dejan Markovic, Committee Chair University of California, Los Angeles 2009 ii

4 TABLE OF CONTENTS I Introduction The Logical Effort Model Motivations and Solutions Optimal Sizing Using the Logical Effort Model Tapering and the Logical Effort Model Modeling the Input Slope Effect Low Power Optimization Beyond Sizing Thesis Outline...13 II The Slope Correction Model Motivation the Input Slope Effect The Proposed Slope Correction Model Alternative Formulations...19 III Extracting Model Parameters Extracting Logical Effort Parameters The Reference Case for Delay Estimation Extracting K from Slope Correction Term Extracting K under V DD Scaling...26 iii

5 IV Sizing Comparisons for Buffer Chains Gate Sizing using Slope Correction Model Comparisons in the Energy-Delay Space Limiting Factors in Sizing Effectiveness...35 V Incorporating Supply Voltage Optimizations Modeling Delay under Supply Voltage Scaling Concurrent Optimization of Sizing and Supply Voltage Sub-threshold and IC Model Modeling Energy under Supply Voltage Scaling Optimization with Aggressive Supply Voltage Scaling...48 VI Optimization for Synthesized Design Low Power Optimization for Synthesis Flow Issues Characterizing Low-V DD Performance Variations Standard-Cell Designs and the Slope Correction Model Comparison of Estimation Accuracy Sizing Synthesized Design using the Slope Correction Model Comparison of Sizing Optimization Results Improving the Optimization Tool Incorporating V DD Scaling in Optimization of Synthesized Designs...77 VII Conclusion...81 iv

6 Appendix: System-Level Optimizations for Low Power Designs...83 A.1 Simulink Design Environment...83 A.2 Automated FPGA Hardware-Acceleration...84 A.3 Architectural Optimization...89 A.4 Wordlength Optimization...93 A.5 Concluding Remarks...96 References...97 v

7 LIST OF FIGURES 1-1 A minimum-delay buffer chain, with equal fan-out of 4 per stage A buffer chain with 1.5% increase in delay Tapered inverter chain. Logical effort model assumes equal slope at the input and output of each gate. Actual input slope is sharper Input transition time vs. delay for an inverter driving a fixed 4 load Simulated and calculated values of the normalized delay vs. input transition time Energy-delay trade-off curve Input transition time vs. delay for an inverter driving a fixed 4 load, with estimated K slope-delay-factor shown Delay (t p ) and transition time vs. fan-out Output delay vs. input fan-out for gate fan-out of 7.5 and 10. Delay is normalized to 0 of an inverter Sample apparatus for extracting logical effort parameters Energy vs. simulated delay and normalized logical-effort delay, the difference between the two graphs is the slope-correction term Slope-correction factor K for different gates in a 65-nm technology across supply voltage of 0.5V to 1.2V Internal energy vs. delay curve for a NAND2 based buffer chain in a 65-nm technology at 1V supply voltage vi

8 4-2 Internal energy vs. delay curve for an inverter chain in a 65-nm technology at (a) 1V supply voltage (b) 0.5V supply voltage Total energy (a) and V DD (b) vs. delay for an inverter chain in 90-nm technology V DD vs. I D for α-power and leakage model against simulation V DD vs. I D for α-power, leakage, and IC model against simulation V DD vs. I on and I off for the IC model against simulation V DD vs. I on and I off for the IC model for LVT and HVT cells Total, switching, and leakage energy vs. delay for (a) LVT with α=1% and (b) HVT with α=10% designs Energy vs. delay for sizing and V DD optimizations with α = 1% and 10% V DD vs. delay for the V DD optimization above Energy-delay sensitivity (left-axis) and energy-delay tradeoff (right-axis) Energy-delay sensitivity near MDP (left) and MEP (right) Energy vs. delay for adder optimization through V T adjustment V DD vs. delay for adder optimization with V T as an optimization variable Inverter delay (t p ) vs. V DD, normalized to delay at 1.0V-TT corner Inverter delay and clock-to-q delay vs. V DD, normalized to inverter and clockto-q delays at 1.0V-TT corner The clock-to-q transition of a D-flip-flop under 150mV SS corner The clock-to-q failure of a D-flip-flop under 165mV FS corner The original netlist in text and the reformatted netlist in Excel vii

9 6-6 Flow diagram of the optimization tool Energy vs. delay for a 16-bit adder driving 512 ff of load, optimized with the slope-correction model using 65-nm standard-cell library Flow diagram of the improved optimization tool Gate-information table for each gate Energy-per-operation after each optimization step Energy-delay plot of the optimized adder V DD vs. delay of the adder optimization A-1 Snapshot of Synplify DSP blocks in Simulink design environment A-2 A 32 tap FIR design with shared-fifo interface and testbench A-3 A Synplify DSP design created as a black-box A-4 A Synplify DSP design created as a black-box A-5 Testbench for the Synplify DSP design A-6 Possible transformations and valid architectures given the constraints A-7 Time-multiplexing for designs with (a) parallel and (b) sequential processing A-8 Time-multiplexing is attractive for area savings and performance increase due to a small energy overhead A-9 Energy vs. delay and energy vs. area plot for 1-8x time-multiplexed (and pipelined) logic A-10 Snapshot of a CORDIC design before wordlength optimization...94 A-11 Snapshot of the same CORDIC design optimized for MSE of viii

10 LIST OF TABLES 6-1 Delay comparisons for two synthesized adders ix

11 ACKNOWLEDGEMENTS First of all, I am wholeheartedly thankful to my advisor, Dejan Markovic. Not only is he patient and helpful with sharing me his knowledge and technical skills, his brilliance in ideas and passion for this work, along with his sense of humor, has made it truly joyful to work with him. This work is based on his slope-correction idea from my first quarter here as a graduate student. Over the past year-and-half, this hobby project has gradually grown into what is presented here today. Not only have I gained so much knowledge in this process, the experience and the enthusiasm I acquired in this field will continue to benefit me far beyond this project alone. This is an important milestone in my graduate career, and I would not have made it here without him. I wish to thank Professor Rajeev Jain, Mani Srivastava, and Lieven Vandenberghe for being on my thesis committee. Their helpful and thoughtful comments are definitely appreciated. I also wish to thank Professor Vandenberghe for his patience with me when I was a complete novice in his convex optimizations course. What I learned from him has benefited me greatly on this thesis work. I am also grateful for having the best group members. I will remember the hectic and interesting times of tape-out with Vaibhav Karkare and Chia-Hsiang Yang. I cherish the endless discussions and (friendly) arguments with Rashmi Nanda, Victoria Wang, Viviane Ghaderi and Sarah Gibson about projects, coursework, and research ideas, along with interesting findings in food and fashion. I also thank Tsung-Han Yu and Fenbo Ren for interesting discussions during projects and group meetings. x

12 I sincerely thank my parents for their never-ending care, and for always being my closest teacher and counselor. They shaped me the way I am today, and I am forever indebted to them. I also wish to thank my dear Helen for her daily support, listening to my babbles and encouraging me when I needed them the most; you are the biggest blessing in my life. Above all, I thank my God and Savior. His goodness, grace, and love have been my greatest strength. xi

13 ABSTRACT OF THE THESIS Design and Optimization of Low-power CMOS Logic Using Logical Effort Model with Slope Correction by Chengcheng Wang Master of Science in Electrical Engineering University of California, Los Angeles, 2009 Professor Dejan Markovic, Chair The logical effort model is helpful in optimizing gate sizes for minimum delay by allocating equal fan-out to every stage of the datapath. However, such approach is very energy-inefficient, and significant energy reduction is possible by allowing a small penalty in performance and tapering the gate sizes. Tapering reduces energy by increasing fan-out toward the latter stages, thus decreasing total gate sizes, but also causes sharper transition times at the input than the output. This causes the logical effort model to become overly pessimistic because it assumes the transition times to be equal. Such inaccuracy leads to suboptimal gate sizes because delay slacks are not fully utilized. This work introduces a slope-correction model to account for the slope mismatch between the input and output of a gate. This improved model has a simple formulation in xii

14 which only one additional parameter is needed, thus preserving the simplicity of the original model. It maintains less than 5% error against simulations even under large variations between input and output slope, and achieves superior optimization results than the original model. This model is then incorporated with supply voltage scaling to achieve even larger energy savings. A transistor model accurate for all regions of transistor operation is employed to allow aggressive supply voltage reductions down to the point of minimum energy. To allow optimization of complex synthesized logic, a large-scale optimization tool is created to allow efficient global optimization of all logic gates within a design, along with supply and threshold voltage, when possible. xiii

15 CHAPTER I Introduction 1.1 The Logical Effort Model Motivations and Solutions Most modern digital CMOS designs are timing-driven, therefore it is always a great interest for designers to estimate delays of logic gates and datapaths, even during the primitive design phase, and to size the logic gates in order to meet the timing requirements. Even though logic delays can be estimated very accurately using simulators such as SPICE and Spectre, such approach is extremely slow and is not feasible for any substantially large designs, nor does it provide any design intuitions for the designer. The designer may try to make one gate larger in hope to improve timing, but such change creates a larger load on the previous stage, which may have counteracted any timing improvement and result in worse timing than what he/she started with. In addition, this change also resulted in higher energy dissipation due to the larger gate size. Such scenario of higher energy and longer delay should certainly be avoided, which calls for a need to optimize gate sizes. Sizing gates in order to meet a delay constraint with minimal power dissipation has been a key requirement in most digital designs. The majority of modern digital designs are synthesized using synthesis and statictiming-analysis tools, which enable the users to create multi-million-gate standard-cell designs in a timeframe that would not have been possible manually. Such advancement 1

16 may lead one to think that design intuitions in sizing and timing are no longer necessary. However, timing violation is a common issue during synthesis, and though re-synthesis with different constraints may resolve the problem, violations that occur after the placeand-route stage mostly cannot be re-synthesized due to the high cost of re-starting the entire place-and-route flow. Correcting these violations is still a great challenge as it requires local resizing and re-routing based on the availability of routing space and area (for re-sizing and adding spared cells), and is often performed manually. Though synthesis is very commonly used, many high-performance designs are still customdesigned, because there is still a minimum of 2 performance difference between stateof-the-art synthesized designs and full-custom designs [1]. In the end of the day, intuitions about sizing and timing is still essential for any good designer, which falls back to the old saying, don t let the computer think for you. Out of the need for an intuitive model for gate sizing, the logical effort model is created, which provides designers an elegant and intuitive solution in estimating gate delay. It is formulated as 0 t g h p p (1.1) where 0 is the delay (50% 50% transition) of a reference inverter without parasitics, g is the logical effort of the gate, h is the electrical effort C out /C in, and p is the parasitic (or self-loading) delay [2]. The parameter g is interesting; it defines the input capacitance required for a gate to have the equivalent drive strength as an inverter. This means if a complex gate requires 2 the width (thus ~2 the input capacitance) to achieve the same drive strength as an inverter, its g is 2. The term fan-out is defined as g h, so for the same 2

17 output load, the fan-out of this complex gate is twice that of an inverter, because its drivestrength is only half. Since 0 is the same for every gate of a given technology library, we often remove it from the logical effort formula by defining tp d, (1.2) 0 which allows the logical effort model to be simplified to d g h p. (1.3) The parameters g and p are dependent only on the gate structure, and not its sizing, therefore they are constant for each type of gate. Only h changes based on the gate loading, therefore the delay of a gate is a linear function of its output load, which is the input capacitances (gate sizes) of the following stage. Such first-order approximation of delay may seem rudimental comparing to a level 49 Spice model, however, while this model should not be used for final timing sign-offs, it produces a surprisingly good fit given its simplicity, and is sufficient for providing intuitions on optimal sizing, as discussed in the next section. 1.2 Optimal Sizing Using the Logical Effort Model Since 1974, Lin & Linholm have mathematically proved that, given a buffer chain of N stages, having an equal fan-out per stage provides the shortest delay for the buffer chain [3]. The optimal fan-out per stage is given by 3

18 C C load FO N. (1.4) input The logical effort model extends such formulation to include all gates, and not just buffer chains. Branching is also included, which is modeled as extra output load. The delay of an N-stage datapath and its optimal fan-out are defined as: N, (1.5) D g h b p path i i i i i1 N FO N g h b i i i. (1.6) i1 The above formulation defines the optimal fan-out per stage to achieve minimum delay, but it does not define how many stages should be used. According to [1], the optimal fanout per-stage lies between 3.3 and 4, so if the solution from (1.6) is greater than 4, more buffering stages should be added to reduce the fan-out; if the fan-out is less than 3.3, stages should be removed or combined. If numerous solutions are acceptable, one with a lower number of stages should be adopted to reduce energy dissipation. Let us now examine a design example, given a buffer chain that need to drive a load of 1024 with an input load fixed at 1, we see a chain of 5 stages can achieve fan-out of 4 per stage (since there is no branching in this buffer chain, b = 1). The resulting buffer chain is shown in Figure 1-1 along with the sizing of each stage. The total buffer size for this minimum-delay design is

19 Figure 1-1: A minimum-delay buffer chain, with equal fan-out of 4 per stage. The idea of having minimum-possible delay may seem attractive; however, there generally exist a tradeoff between performance and power, and in many cases, maximum performance is not necessary. It is interesting, therefore, to examine the amount of energy reduction we can achieve by allowing a small sacrifice in delay. Since sizing is directly proportional to switching and leakage energy, reducing the gate sizes directly contributes to energy reduction. We proceed by taking the same buffer chain from Figure 1-1, but allowing a 1.5% relaxation in delay to re-perform sizing optimization, and the resulting design is shown in Figure 1-2. We see the total buffer size is now 151, a reduction of more than 55% Figure 1-2: A buffer chain with 1.5% increase in delay. Such large energy reduction with a small sacrifice in delay seems remarkable; this is because the minimum-delay point is very energy-inefficient. If we allow further relaxation in delay, the rate of additional energy reduction diminishes drastically. It is interesting to observe the fan-out-per-stage of the design in Figure 1-2; unlike 5

20 an equal fan-out of 4 per stage, the fan-out of this design is 2.8, 2.86, 3.26, 4.33, and 9.06, respectively. By using low fan-out gates until the end of the datapath, this design effectively reduces the size of the latter stages that contribute to most of the internal energy. This scenario of increasingly larger fan-out is called tapering. 1.3 Tapering and the Logical Effort Model For the equal fan-out design in Figure 1-1, the logical effort model is able to estimate delay to within 1% accuracy comparing to simulations. However, for the tapered design in Figure 1-2, the logical effort model over-estimates delay by more than 10%. The cause of this discrepancy lies in the model s assumptions about input and output slopes. The logical effort model (1.3) suggests a linear relationship between fan-out and delay independent of the input transition time, which is certainly not true. As a result, the linear relationship only holds true under the condition that input and output transition times are equal [2]. Such assumption holds for the scenario in Figure 1-1 because the fanout is 4 every stage, so the rise and fall times of the input and output are approximately equal. However, the design in Figure 1-2 does not follow such assumption. The input fanout is smaller than the output fan-out, resulting in sharper rise and fall times at the input. Yet because the logical effort is unable to model the input transition time, it still assumes the input transition to be as slow as the output transition (Figure 1-3), which results in overly pessimistic estimates. Such scenario is called the input slope effect. 6

21 actual i - 1 LE i i + 1 Figure 1-3: Tapered inverter chain. Logical effort model assumes equal slope at the input and output of each gate. Actual input slope is sharper. Such pessimistic estimation from the logical effort model is undesirable in sizing optimizations. For example, if the timing requirement for the previous design is 10% slower than that of a minimum-delay design, the logical effort model would produce a sizing with only 1.5% higher delay due to its inaccuracy. Such modeling error results in energy-suboptimal design because the delay slacks have not been fully utilized. 1.4 Modeling the Input Slope Effect The input slope effect caused by tapering is known before the logical effort model is even established. To account for the input slope effect amount tapered gates, Ma & Franzon [4] have formulated the gate delay t p as: t p tstep Bt slope, (1.7) where t step is the gate delay under a step input, t slope is the input transition time (usually from 20% to 80%), and B is the sensitivity of delay to input slope. Though t step was not intended to be modeled by the logical effort model, the parameters g, h, p, and 0 can be re-characterized to fit the step-input delay. However, calculating t slope still requires a separate model, and parameter B needs to be extracted from simulation. This formulation 7

22 Output Delay (ps) has been used to optimize sizing for arithmetic blocks in [5] and reduces the estimation error to within 5% compared to simulations, while the error from the logical effort model can exceed 20%. However, this accuracy comes at the cost of requiring separate equations and coefficients for rise and fall transitions, along with separate models for delay and transition time. Another assumption that Ma & Franzon is making in (1.17) is that delay increases linearly with input transition time, which we need to confirm. Input Transition Time vs. Delay for an Inverter Driving 4x Load 45 Simulation 40 LSQ Fit of Ma & Franzon Common Transition Times Input Transition Time (ps) Figure 1-4: Input transition time vs. delay for an inverter driving a fixed 4 load. As shown in Figure 1-4, the relationship between input transition time and delay is actually nonlinear, especially for very short transition times. For common transition times, however, the relationship is quite linear, and could be approximated by a first-order model. However, equation (1.17) requires the linear extrapolation to start from t step, 8

23 making the fit less accurate. As shown in Figure 1-4, the slope of the least-squared fitted curve of (1.17) does not fit well with simulation data because of the fixed anchor point at transition time 0 (t step ). The designs in [5] did not have to drive large loads, so parameter B could be fitted just for short transition times, and thus provided better accuracy. In more recent years, many have modified the logical effort model to better model the input slope effect, along with switching behavior, I/O coupling capacitance, mobility degradation and velocity saturation effects [6], [7]. However, [6] requires a few SPICEmodel parameters and 3 additional fitting parameters, along with a nonlinear model for input slope effect involving recursive calculations. The model uses a fast-input model for all transition times faster than a constant Fast, and transitions slower than Fast are modeled based on a derivation of the alpha-power model. The resulting fit, however, is quite good for even very slow input transition times, as shown in Figure 1-5. The linear relationship between delay and input slope does not hold for very slow input transitions, but in most digital designs, the input fan-out is less-than or equal-to the output fan-out, so the input slope should be better (or at least not much worse) than the output slope. In reality, we only need to be concerned about σ HL ranging from 0 to 10 in most digital designs, which results in a curve similar to Figure 1-4 (let Fast = 20ps). 9

24 Common Input Transition Range of Interest Figure 1-5: Simulated and calculated values of the normalized delay vs. input transition time [6]. Model [7] adds 3 additional terms to (1.3), and each term is based upon complex calculations from the SPICE model. The details of [6] and [7] will not be discussed here, because their usage is scoped for synthesis tools due to their modeling complexity and the numerous additional parameter extractions required. Although they both have average modeling error of <5%, they are unintuitive for studies and designs, inelegant for hand calculations, and inefficient for optimization tools. There is yet to be an accurate and intuitive solution for the input slope effect, and one motivation of this thesis is to explore such solution. An accurate model for tapered gates is very beneficial in low power optimizations as well, as we will see in the next section. 10

25 1.5 Low Power Optimization Beyond Sizing From the buffer chain example in Section 1.2, we see that allowing a small delay penalty can significantly reduce energy dissipation comparing to the minimum-delay case. However, sizing optimization alone has two limitations: a) re-sizing the datapath only reduces internal energy, but not the output load. As a result, reducing datapath gate sizes quickly reaches diminishing effects when the total energy is dominated by the output load; b) most designs have fan-out limitations due to reliability concerns, further limiting the effectiveness of sizing optimizations. To further reduce total energy, other circuit parameters such as supply voltage (V DD ) and threshold voltage (V T ) ought to be optimized concurrently with gate sizes [8]. Since switching energy is quadratically proportional to V DD, supply voltage reduction is very effective in reducing total energy. However, V DD scaling also slows down the circuit exponentially, therefore for very low V DD (near V T ), the circuit is operating so slowly that its leakage energy becomes significant. V DD scaling reaches its limit when the leakage penalty caused by further V DD reduction overpowers the potential energy savings, then a point of minimum energy is realized. Given sizing, V DD, and V T as optimization parameters, a pareto-optimal curve can be constructed in the energy-delay space (Figure 1-6) between the point of minimum delay (Dmin) and the point of minimum energy (Emin). Extensive research has been conducted in this area in the past decade [9-13], where given a performance criteria, there is a unique solution in gate size, V DD, and V T that minimizes the energy dissipation. Since V T is generally fixed for a given technology, sizing and V DD are the main optimization 11

26 variables. Near Dmin, the optimization is mainly driven by size tapering, which quickly reaches its limit, and the rest of the energy-delay curve is achieved by reducing the V DD of the tapered design. >1000x Delay >10x Energy Figure 1-6: Energy-delay trade-off curve. Though traditional designs using the logical effort model focused on optimization near the minimum-delay point, the minimum-energy point is of great interest as well, especially for low-power designs. However, minimum-energy point usually requires aggressive scaling, causing the circuit to operate in the sub-threshold regime [14]. This scenario again calls for an improved modeling, for traditional I-V models and the popular alpha-power model [15] all formulate drive-current to be proportional to (V DD -V T ), therefore, as V DD reaches V T, drive-current reaches 0 and delay becomes infinite. Fortunately, much research over the past decade have focused on sub-threshold design, which produced a more accurate EKV/IC [16] model that is accurate for all regions of 12

27 transistor operation, and have shown that minimum-energy design is indeed feasible and attractive [14,17]. It is established that the minimum-delay point is achieved at a high penalty in energy, and in vice versa, we will see the minimum-energy point is associated with a substantial performance penalty. However, allowing a small compromise in energy consumption can result in a substantial increase in performance, as we will see in latter chapters. With an accurate delay model for sizing tapered gates, combined with an accurate model for V DD scaling, it is now possible to weigh the trade-offs in transistor sizing, V DD scaling, and (when possible) V T scaling in designing low-power circuits to achieve power-performance optimal designs. 1.6 Thesis Outline The subsequent chapters first define the proposed slope correction model and its derivations, along with the approximations made in order to arrive at an intuitive yet accurate model (Chapter II). Chapter III details the extraction of the required parameters for the model, and the apparatus used for different types of gates. Using the extracted parameters, Chapter IV compares the estimation accuracy of the proposed model versus the original logical effort model, and demonstrates in the energy-delay space their differences when applied toward energy optimizations. Chapter V introduces V DD as an optimization variable and first uses the alpha-power model to model V DD scaling; it then demonstrates that the EKV/IC model, though more complex, is more suitable for ultra low-power applications because it models the entire V DD region accurately. Chapter VI 13

28 extends the model s application to synthesized designs by presenting a Matlab tool that optimizes standard-cell designs based on the presented model, which enables postprocessing of synthesized netlists, or to be used concurrently with synthesis tools in locating an optimal V DD given the power/performance requirements. Chapter VII concludes the thesis. The Appendix section ascends one level of abstraction and outlines system-level optimizations for low-power designs, including architectural transformation, word-length optimization, and the proposed Simulink-based design/optimization flow. 14

29 CHAPTER II The Slope Correction Model 2.1 Motivation the Input Slope Effect As introduced in Chapter 1, gate size tapering is very effective in reducing the energy dissipation of equal fan-out design by allowing a small penalty in delay. Such scenario, however, causes the input slope to be sharper than the output slope due to increasing fanouts in the datapath (also called slope mismatch). The logical effort model is unable to model such scenario, thus making it inaccurate in delay estimation of low-power designs. Some proposed solutions were introduce in the previous chapter, though most are overly complex and are targeted for synthesis tools. The solution discussed in Ma & Franzon [4] is simple and intuitive, but it is evident that its modeling accuracy needs improvement. The motivation of the slope correction model is to improve the accuracy of the logical effort model by accounting for the input slope effect while preserving the simplicity and intuition of the original model. 2.2 The Proposed Slope Correction Model Due to the nonlinear relationship between input slope and delay, the linear model from [4] is unable to provide a well-fitted curve, even though the relationship is quite linear for common input transition times. As a solution, we would like to preserve the linear model 15

30 for its simplicity, but with better fitting to improve its accuracy. When input and output slopes are equal, the original logical effort model is able to model the delay accurately, so it serves as a good reference point. It is shown here again for reference: 0 t g h p p. (2.1) However, when the gates are tapered, logical effort assumes a pessimistic input slope and overestimates the delay. Instead of calculating delay based the step-input delay and the input slope as in [4], which requires a long extrapolation, we propose to start with the estimate from (2.1) and simply subtract delay based on the slope difference between the input and output of the gate. Such model can be formulated as below: t p t tle K t slope, out slope, in slopedelay factor. (2.2) The parameter K slope-delay-factor is slope-to-delay sensitivity, which defines how much delay is associated with the slope difference. Since t slope,in and t slope,out for tapered gates cannot be as sharp as step-inputs, nor can they be very slow due to the maximum fan-out limit in most designs, they generally fall within the common transition times in Figure 1-4. Based on this assumption, K slope-delay-factor can be approximated as the slope of the linear region on the delay vs. input transition time plot in Figure 2-1. This proposed model evidently provides a better fit than [4] because its y-intercept is not fixed at t step. It therefore avoids the nonlinear region near very short transition times, which rarely occurs in digital logic because well-designed gates have fan-out of at least 1, in addition to parasitic loading, which is sufficient load to provide an input/output slope of at least 30 (Figure 2-2). 16

31 Output Delay (ps) Input Transition Time vs. Delay for an Inverter Driving 4x Load 45 Simulation K Slope-delay-sensitivity Input Transition Time (ps) Figure 2-1: Input transition time vs. delay for an inverter driving a fixed 4 load, with estimated K slope-delay-factor shown tp Input Slope 10%-90% FO Figure 2-2: Delay (t p ) and transition time vs. fan-out. 17

32 However, (2.2) still requires calculating the input and output transition times (t slope,in, t slope,out ) at every node of the datapath, which could be tedious for the user (this is one of the drawbacks of the Ma & Franzon model). To simplify the formulation, we see that transition times can be approximated by an RC model [5], which can be modeled as a linear function of fan-out (Figure 2-2). Such modeling is an approximation, because (similar to delay formulations) the transition time of a gate also depends on the transition times of its previous stages. However, modeling such scenario would require the t slope model to be a recursive function, which is unattractive for hand-analysis. Now the slope-correction term can be formulated as a function of fan-out rather than transition time, we can then calculate the delay (of gate i ) as, t g h g h. (2.3) i i i1 i1 p, i tle, i 0 KFOdelayfactor, i Since g, h, and 0 are needed for the logical effort model, the only additional step is to extract the gate-specific parameter K FO-delay-factor (K in short). Once K is extracted, the model can achieve better accuracy than the linear model from Ma & Franzon, as shown in Figure 2-3. Since slope mismatch occur in tapered gates whose output loading is significantly larger than the input load, inverters driving output fan-out of 7.5 and 10 are shown (well-tapered gates will not have input fan-out greater than output fan-out). The output delay is a slightly nonlinear function of input slope (or input fan-out), and the proposed slope correction model makes a more accurate linear approximation. The slope correction model is most accurate when input and output fan-outs are equal, because that is the case with no slope mismatch, so it produces the same estimation as the original logic 18

33 Delay (noralized to 0 ) effort model. The original logical effort model is clearly inaccurate in the case of slope mismatch, and its estimation error due to input slope effect alone can reach more than 20% in heavily tapered gates Output FO = 10 8 Output FO = LE Model (1) Simulation Proposed Slope Correction Ma & Franzon(2) Input Fan-Out Figure 2-3: Output delay vs. input fan-out for gate fan-out of 7.5 and 10. Delay is normalized to 0 of an inverter. 2.3 Alternative Formulations Alternatively, we can define a parameter s to be 1/K. Based on (2.3), we can then formulate delay (of gate i ) as a weighted-sum of g h from the current and previous stage, t s g h s g h p. (2.4) p, i 0 1 i i i i i1 i1 i 19

34 Comparing to the logical effort model, the only additional parameter is s i, so (2.4) is still simple enough for hand-analysis. Intuitively, complex gates tend to have weaker drivestrength, so even with a fast transition at the input, delay is still dominated by its own drive-strength. As a result, complex gates that are drive-strength-limited should have smaller s as their delay depends more on their own sizing. On the other hand, simple gates such as inverters are stronger drivers, so their delay will be more dependent on the input transition time. These gates are input-slew-limited, and should have larger s. This hypothesis will be verified after the extraction of K. The logical effort model defines fan-out to be g h, however, the parasitic load p also contributes to delay because it is additional capacitance that the driver needs to charge and discharge. As a result, some have suggested that fan-out should be defined as g h + p, then the optimal fan-out per stage for minimum-delay will be N FO N g h b p i i i i. (2.5) Since g and p of each gate is known, the load h for gate i can be determined as i1 gi hi FO pi. (2.6) For those that prefer the formulation above, (2.3) can alternatively be modeled as t g h p g h p. (2.7) i i i i1 i1 i p, i tle, i 0 KFOdelayfactor, i Equations (2.4) and (2.7) are each intuitive in their own aspects, and can be used based on user preference. However, the rest of this thesis will follow the original definition of fan-out as described in [2], and will use (2.3) as the slope correction model. 20

35 CHAPTER III Extracting Model Parameters 3.1 Extracting Logical Effort Parameters Most parameters of the slope correction model are the same as those for the logical effort model, and to properly extract the slope-correction factor K, the logical effort parameters ( 0, g and p) need to be extracted first. A simple way to extract these parameters for any gate is by simulating a chain of gates M M 2 M-1 M-1 M(M-1) M-1 M(M-1) M(M-1) a) 1 M M 2 M 3 M 4 M 5 b) Figure 3-1: Sample apparatus for extracting logical effort parameters. 21

36 Figure 3-1 shows two sample apparatuses for extracting the logical effort parameters. Figure a) is the apparatus shown in [2], where a chain of identically sized gates are used, and each drive gate drives itself, plus another gate of (M-1) size. The (M- 1) sized gate is used to drive another fan-out of M to prevent Miller effect. To create a gate of size M, do not simply make the gate M times wider; instead, a multiplier of M should be used. This scales the gate and parasitic capacitances more accurately, and is also a more realistic scenario, for most standard-cells are limited in width (usually 1-2μm due to fixed spacing between V DD and ground rails), so a wide gate is created by using multipliers. The gate delay is gathered at the 4 th gate in the chain (shown in red). The reason for such set-up is that the first 3 stages are used to shape the proper transition time for a fan-out of M, so the 4 th gate will have equal fan-out of M at the input and output, and will not be affected by the input-slope effect seen by the first gate [2]. The gate delay t p should be the average of both rising and falling delays. Alternatively, apparatus b) can be used. Though this is not generally used to extract logical-effort parameters, this apparatus will be used to extract K. The large fanout of M 5 at the output allows sufficient room for tapering to provide enough slope mismatch data for extracting K. To properly extract the logical effort parameters, start with an inverter, then sweep M from 2 to 10 and extract its gate delay as a function of M. The extracted gate delays should be fitted into a function: p 0 t M p, (3.1) 22

37 because g for an inverter is 1, and fan-out h is equal to M. Parameter 0 is the slope of the line (delay increment per additional fan-out), and p is the y-intercept of the line (selfloading is the gate delay when fan-out is 0). For complex gates, each input should be characterized separately while tying the other inputs to supply or ground to create the worst-case scenario (e.g. in an AOI gate, the NAND and NOR function should be characterized separately). The extracted gate delay should be fitted into a function: p 0 t g M p. (3.2) However, 0 should be the same as the reference inverter from (3.1), so changes in the slope of the line should be fitted by g. Parameter p will also be different because complex gates generally have more self-loading. More details about extracting the logical effort parameters can be found in Chapter 5 of [2]. 3.2 The Reference Case for Delay Estimation To accurately extract the error caused by slope-mismatch, we first need to calculate the estimation error with equal fan-out per stage to serve as reference. Similar to Figure 3-1b, a chain of 5 stages is used for our extraction, and the output load is set to This time, however, we are interested in minimizing the delay from the input of the first gate to the output of the last gate. We know from [2, 3] that equal fan-out per stage leads to minimum delay, which can be calculated using the fmincon function in Matlab, or just 23

38 calculated by hand. In this case, the delay is the logical-effort path delay D LE (normalized to 0 ) modeled by (3.3), N. (3.3) D g h p LE i i i i1 The minimum delay in this case has equal fan-out of 4 per stage. Since the input and output fan-outs are equal, the slope-correction term has no effect, and the logical effort model is quite accurate. The error against simulation results is typically less than 5% for common gates. We define this error to be the reference error D err,ref, because it is not caused by slope mismatch. Once size-tapering is used for energy reduction, slope mismatch will cause the logical effort error to increase. 3.3 Extracting K from Slope Correction Term Given the minimum-delay design, we can introduce tapering to reduce the gate sizes by allowing longer delays. To allow sufficient room for tapering, delay constraint is relaxed by up to 50% to observe the energy reduction and slope mismatch at different delay points. To minimize energy, we used the fmincon function in Matlab to minimize gate sizes given the delay constraint modeled by (3.3) is met. The optimization produces tapered gate sizing, causing the model from (3.3) to over-estimate delay comparing to simulation. Since the reference error D err,ref is calculated in the previous section, we can now isolate the error caused by tapering, which is used to extract K in the slope correction model. Adding the slope-correction term, we can model the delay D LE,SC of a datapath as 24

39 D g h p g h g h N i i i1 i1 LE, SC i i i i1 K, (3.4) i where N is the number of logic stages in the path, and index 0 represents the input driver. In this gate characterization, the same gate is used in every stage, thus the same K, therefore the intermediate terms of (g i h i )/K i cancel out with the (g i-1 h i-1 )/K i of the next stage, and the delay model can be simplified to D LE, SC g h g h K N N indriver indriver DLE, (3.5) where the first term is the original logical effort model estimations from (3.3), and the second term is the slope correction. Internal Energy (normalized) D SC Simulation Original LE Delay Increment (%) Figure 3-2: Energy vs. simulated delay and normalized logical-effort delay, the difference between the two graphs is the slope-correction term. 25

40 Comparing against simulation results D sim, we can extract (3.5) by setting D LE,SC = D sim D err,ref. The slope-correction term in (3.5) can be extracted as D SC = D sim D err,ref D LE. It is shown graphically in Figure 3-2, where D sim is plotted against D LE + D err,ref, and D SC is the difference between the two plots. From the slope-correction term, we can extract K of the gate, because g N h N and g in-driver h in-driver are both known. For each gate, the extracted K varies slightly with fan-out due to the non-linearity of delay (Fig. 2-1), so K is determined as the least-squares fit of values extracted at different fan-outs. Even though this leastsquared fit provides a more accurate fit for K, it is more time consuming. To save time, we can instead perform simulation for only one typical scenario (i.e. delay slack of 10%), and the extracted K is generally within 5% comparing to the least-squares-fitted K. 3.4 Extracting K under V DD Scaling As discussed in Chapter I, V DD scaling is very effective in reducing the energy dissipation for low-power applications, therefore, it is interesting to extract K under different supply voltages and observe any changes. Fortunately, supply voltage directly affects the drivecurrent of all gates, therefore V DD scaling only scales 0, and remaining logical effort parameters still provides an accurate linear fit. Given such scenario, we can simply gather the simulation data at different supply voltages, divide the delay by 0, and use (3.5) to re-extract K using the same method. The extracted K for a variety of gates are shown in Figure 3-3, under supply voltages of 0.5 to 1.2V. The inputs to NAND and NOR gates all provide similar K values, and are not plotted individually, but the inputs for the two branches in AOI are shown separately. 26

41 Slope correction factor, K AOI12 NAND AOI12 NOR NAND3 NAND2 NOR2 Inverter Supply Voltage, V DD (V) Figure 3-3: Slope-correction factor K for different gates in a 65-nm technology across supply voltage of 0.5V to 1.2V. It is interesting to see that K reduces as supply voltage is decreased, suggesting a stronger input-slope effect. Intuitively, this is because drive-current is still exponentially proportional to gate-to-source voltage when the transistor is in sub-threshold. As V DD scales down, the transition point (V M ) becomes very close to (and eventually crosses) V T. As a result of V DD scaling, the transistor that is turning on remains in sub-threshold for the majority (if not all) of its transition period, and because its drive-current is exponentially proportional to its gate voltage, a slow transition at its gate causes a larger penalty on delay. Therefore, gates operating in lower V DD are more sensitive to the inputslope effect. 27

42 In the end of Chapter II, we hypothesized that more complex gates will have larger values of K, because their limited drive-strength causes slow output transition long after the input has settled, making sizing a more dominant factor on their delays than input transition time. Such hypothesis is verified in Figure 3-2, where we see that complex gates such as AOI and NAND3 have larger values of K, while the inverter has the smallest value of K. 28

43 CHAPTER IV Sizing Comparisons for Buffer Chains 4.1 Gate Sizing using Slope Correction Model With the logical effort parameters and parameter K extracted (Chapter III), we can again use fmincon in Matlab to minimize gate sizes, but instead use the delay model (3.4) to serve as the delay constraint. However, it is interesting to note that the minimum delay possible with (3.4) is no longer produced by equal fan-out per stage. To validate this observation, let us examine the gradient differences between the two models. Using the logical effort model to estimate buffer chain delays, a N-stage buffer chain from (3.3) can be described as a function of gate sizes N C i1 DLE gi pi i1 Ci, (4.1) where C 1 = 1 and C N+1 = C Load. Differentiating (4.1) and setting the derivative equal to 0, we obtain dd LE i1 0 gi 1 gi 2 dci Ci 1 Ci 1 C, (4.2) and after multiplying both sides by C i, we obtain 29

44 g C C i1 i i gi 1, (4.3) Ci Ci 1 which means minimum delay is achieved by equal fan-out per stage, as expected from [2, 3]. However, equation (3.4) poses a slightly different scenario, because now there is a slope-correction term that is also a function of C i. For easier differentiation, let us first formulate (3.4) as D 1 g C p 1 g N i1 i LE, SC 1 i i i1 i1 Ki Ci Ki Ci 1. (4.4) C Differentiating (4.4) and setting it equal to 0, we obtain 0 dd LE, SC (4.5) dc i Ci Ci 1 1 gi 1 1 gi g 2 i1 gi 2 Ki 1 Ci 1 Ki Ci Ki Ci 1 Ki 1 Ci. However, for the last stage driving the large capacitive load, there is no i+1 th buffer stage, therefore the derivative of (4.4) for the last buffer stage (i = N) becomes 0 dd LE, SC (4.6) dc N CLoad gn 1 1 gn g 2 N 1 KN 1 CN 1 KN CN KN CN 1. After multiplying both sides of (4.5) and (4.6) by C i, we obtain 30

45 1 1 C 1 1 C 1 g 1 g K K C K K C i i1 i1 i i1 i i1 i i1 i 1 CLoad 1 gn KN CN (when i = N). (4.7) Since every gate in a buffer chain is of the same gate type, parameter K is the same for every stage. This implies that every stage in the buffer chain will have the same fan-out, with the exception of the last stage: the fan-out of the last stage will be (1 1/K N ) times larger than the previous stages. Using this formulation, the optimal fan-out per stage for the first N-1 stages are: 1 C FO N 1 K C load input, (4.8) and the fan-out of the last stage is FO (1 1/K). Since the slope-correction model subtracts delay for tapered gates, this derivation suggests that the tapered scenario actually leads to a shorter minimum delay than that possible with the equal fan-out case. 4.2 Comparisons in the Energy-Delay Space To characterize the differences between the original logical model and the slope correction model for low power designs, it is interesting to observe the estimation differences between the two models on the energy-delay space. Function fmincon is used to minimize the gate sizes given either (3.3) or (3.4) as the delay constraint, and the estimation results are compared against simulation. 31

46 Internal Energy (normalized) Internal Energy (normalized) In the previous chapter, Figure 3-2 plotted the differences between the logical effort model and simulation for NAND2 gate in 65-nm technology. The same gate is plotted in Figure 4-1 with both the original and the slope-corrected model shown. The reference error (D err,ref ) is subtracted from both models to isolate the error caused by tapering. As a result, all the plots start at internal energy of 1 and delay increment of 0, which is normalized to the equal fan-out case that is serving as the reference. We see the slope correction model provides a much more accurate delay estimation. Even for delay increment of 40%, where fan-out can reach 16 or more, the slope correction model is only slightly more conservative Min Delay 1 A B B 0.8 C C D D 0.6 E Delay Increment (%) 0.2 Simulation Original LE LE with Slope Corr Delay Increment (%) Figure 4-1: Internal energy vs. delay curve for a NAND2 based buffer chain in a 65- nm technology at 1V supply voltage. 32

47 From the inset in Figure 4-1, it is noticeable that tapering does lead to slightly lower delay comparing to the equal fan-out case. During initial downsizing (point A B C), delay actually decreases by nearly 1% (up to 3% under 0.5V supply) and then increases with further downsizing. By taking advantage of tapering, we can reduce energy and delay compared to the equal fan-out reference case. This advantage allows the tapered design to achieve the same delay as the reference case (point E) with 40% reduction in internal energy (varies from 25-60% depending on the type of logic gate and supply). The original logical effort model is inaccurate under tapering, leading to sub-optimal energydelay. For example, at 10% delay increment, the slope-correction model requires an internal energy of 0.28, while the original model requires 0.4. The minimum-delay point (point C) obtained by tapering cannot be predicted by the logical effort model (point C ), but the slope-correction model is able to locate the minimum delay (C) and construct an accurate delay estimation from that point on (C D E etc.). The slope-correction error is within 5% across all supply voltages when the fan-out is less than 32, which is the case in most applications. However, the error may reach 15% for fan-outs greater than 80, because it is difficult to model such large fan-out with this linear model. To demonstrate the scenario under different supply voltages, Figure 4-2 shows an inverter chain in 65-nm technology at V DD of 1.0V and 0.5V. We see the energy-delay characteristics of the inverter at 1.0V is very similar to that of the NAND2 case, actually most logic gates operating at 1.0V have similar energy-delay curves. 33

48 Internal Energy (normalized) Internal Energy (normalized) Simulation Original LE LE with Slope Corr Delay Increment (%) a) Simulation Original LE LE with Slope Corr Delay Increment (%) b) Figure 4-2: Internal energy vs. delay curve for an inverter chain in a 65-nm technology at (a) 1V supply voltage (b) 0.5V supply voltage. 34

49 Under the low V DD of 0.5V, however, the energy-delay curve is sharper the knee is much more apparent. The original logical effort model still provides the same estimation curve as the 1.0V case, but the slope correction model (with a different K for 0.5V) models very accurately. We see that 70% of internal energy can be achieved without sacrificing delay at 0.5V, but such significant advantage of tapering is not modeled by the original logical effort model. 4.3 Limiting Factors in Sizing Effectiveness In this chapter, we observed that sizing is an excellent optimization in reducing the internal energy of an equal-fan-out datapath. However, its effectiveness greatly diminishes after ~20% delay increment, and additional delay slack produces very little energy savings. Plus, most commercial designs have an upper limit on fan-out and transition time due to reliability concerns, which puts an additional boundary on tapering. If the upper-bound on fan-out is 16, then the previous designs could not have internal energy of less than 0.2, which means tapering is only effective up to about 20% of delay slack. In addition, tapering gate sizes only reduces the internal energy of the buffers, and not the total energy. For buffers driving a large load, reducing the internal energy quickly reaches diminishing returns. For the case in Figure 4-2b), even though 20% delay slack can reduce internal to merely 15% comparing to the reference case, the internal energy at that point is only about 3% of the total switching energy. To further reduce 35

50 energy for low power designs, it is essential to also reduce the energy in the load. Due to the necessity to allow more energy reduction than sizing alone, and to reduce the total (and not internal) energy, we ought to incorporate supply voltage reduction in our optimizations. We see in the next chapter that reducing V DD can take advantage of a larger delay slack to allow more energy reduction, and reduces the total energy of the design as well. 36

51 CHAPTER V Incorporating Supply Voltage Optimizations 5.1 Modeling Delay under Supply Voltage Scaling In the previous chapter, we observed that sizing is only effective up to around 20% delay slack, and it only reduces the internal energy of the gates, but not the energy in charging and discharging the load capacitance. To address such issues, it is evident that supply voltage reduction is necessary for low-power designs. It does lead to exponential increase in energy as V DD approaches V T, but such technique allows much more energy reduction than sizing alone. To incorporate supply voltage optimization, it is necessary to accurately model gate delay as a function of V DD. Recent short-channel technologies can be well-modeled by the alpha-power model introduced in [15], where drive-current of a transistor is modeled as D DD T I A V V, (5.1) where parameters A, V T, and α are fitted for each technology. Given such formulation, we can then model the gate delay as t p C VDD VDD I V V DD TH. (5.2) 37

52 Extracting the required parameters is not difficult: given an equal fan-out buffer chain, we can simply gather its delay V DD is scaled down. Using the delay at 1V as reference, we can model the delay ratio as V DD (1 VT ) Delay Ratio (V DD) (V V ) 1 DD, (5.3) T and a least-squared-fit should be able to extract parameters V T and α. The above formulation should model delay accurately for the equal fan-out case, as long as the transistors are operating in strong-inversion (moderate- and weakinversions will be discussed later). However, for the tapered scenario, using a fixed K is insufficient to model all supply voltages, for we observed a lower K under lower V DD (Figure 3-2). Similar to the alpha-power model, we can model K as VDD VTH K VDD A K V DD ref, (5.4) Where parameters A, β, and K ref are obtained by least-squares curve fit of the extracted K in Figure 3-2. Since K is in the denominator of the slope correction model, K(V DD ) is essentially an inverse of (5.1) with a constant K ref added for improved model accuracy. Adding K ref also prevents K from reaching 0 as V DD scales down to V T (as in subthreshold operations), for a K of 0 suggests (unrealistic) infinite slope-correction. For the inverter chain in a 90-nm technology, we obtained A = 1.3, = 1.4, and K ref = The modeled K function fits very well against the extracted K values from Figure

53 5.2 Concurrent Optimization of Sizing and Supply Voltage With models for both delay ratio and parameter K as a function of V DD, we can revisit the 5-stage buffer chain example from the previous chapter, but this time optimizing for both sizing and supply voltage concurrently using the fmincon function in Matlab. The nominal voltage is 1.0V, and since supply voltage reduction is able to reduce the total (and not just internal) energy of the datapath, total energy and V DD is plotted against delay in Figure 5-1. In the previous chapters we demonstrated sizing to be very effective during initial energy reduction of minimum-delay designs, such scenario still holds true here. We see from Figure 5-1 that V DD remains at 1.0V during the first few percentages of delay increase, but give more delay slack, supply voltage reduction becomes the dominant optimization variable for the majority of lower-power optimizations. Such scenario can also be observed visually, where the shape of the majority of the energy-delay curve seems to be a mere quadratic function of the V DD -delay curve. Supply voltage scaling, however, also comes at a cost in performance. We see in Figure 5-1 that a 63% reduction in total energy comes at a 100% increase in path delay, and the energy-delay curve is flattening out, suggesting more delay penalty would apply under further V DD reduction. Nevertheless, it is interesting to observe the maximum potential energy savings achievable with supply voltage scaling. However, it is evident that such optimization is pushing V DD towards V T, which causes the alpha-power model from (5.1) to approach 0 (and the delay to approach infinity). Given such inaccuracies, we must first establish an accurate current and delay model for the near- and sub- 39

54 Total Energy (normalized) Total Energy (normalized) threshold region to be able to effectively optimize low-power circuits under such regions of operations Delay Increment (%) Simulation Original LE LE with Slope Corr Delay Increment (%) a) VDD Delay Increment (%) b) Figure 5-1: Total energy (a) and V DD (b) vs. delay for an inverter chain in 90-nm technology. 40

55 5.3 Sub-threshold and IC Model As supply voltage reaches threshold voltage, it is generally acceptable to model both on- and off- currents of a transistor using the sub-threshold leakage equation: V DD I I n t e ON, (5.5) Leakage Leakage V V DD T nt, and (5.6) I I e S W 2 IS 2n Cox t L, (5.7) where n is the sub-threshold slope factor, σ is the DIBL factor, and Φ t is the thermal voltage given by kt/q, or 26mV at room temperature. Mobility and oxide capacitance C ox are the same as those from traditional I-V equations. Such model is able to model sub-threshold current quite accurately; however, we will see that this model is not suitable for optimizing V DD for low-power designs. As shown in Figure 5-2, the α-power model is unable to model current as V DD approaches V T, and the leakage model becomes inaccurate once V DD reaches above V T. However, in the moderate inversion regime, where V DD is close to V T, neither model is able to model the on-current very accurately. This issue is non-trivial, because we will see that the moderate-inversion regime is very attractive for low-power designs. Another issues that arises with combining α-power and leakage model is that the I D (V DD ) function is not continuous at the transition point. Although we can modify the fitting parameters to make the two equations equal at the transition point, this comes at a 41

56 I D (na) V th simulation -power model leakage model V DD (V) Figure 5-2: V DD vs. I D for α-power and leakage model against simulation. cost of modeling accuracy for the rest of the V DD regimes. Such forced-fitting of the parameters still do not guarantee the gradient of the two functions to be continuous at the transition point, which may cause difficulties during optimizations. Even if the two gradients cannot be equal at the transition point, the gradient of the α-power model should be steeper than the leakage model at the transition to preserve the convexity of the I D (V DD ) function. Fortunately, extensive research has been conducted in such area, and an IC/EKV model has been developed in [16] that is accurate for all regions of transistor operation. IC represents the inversion coefficient, which is around 1 for V DD = V T (moderate inversion), much less than 1 for sub-threshold operations (weak inversion), and reaches 42

57 I D (na) around 100 for strong inversion. The on-current of a transistor can be modeled as: I ON IC I k S, (5.8) fit IC 1 VDD VT ln e 2nt 1 2, (5.9) where k fit is a fitting factor, and the remaining parameters are the same as those in (5.6) V th simulation -power model leakage model IC model V DD (V) Figure 5-3: V DD vs. I D for α-power, leakage, and IC model against simulation. Even though the IC model is not as intuitive as the α-power and the sub-threshold model, and is generally not used for hand-calculations, it is attractive for optimizations because it accurately models the on-current under all regions of V DD (Figure 5-3). 43

58 simulation IC+leakage model I on and I leakage (na) MSE IC = 0.18% MSE leakage = 0.21% V DD 10 0 IC IC = = 1 1 at at 0.372V IC 10-1 Fitting parameters: I S = μa σ = V T = V n = k fit = Figure 5-4: V DD vs. I on and I off for the IC model against simulation. V DD The off-current is still modeled by the leakage model from (5.6), and given the same set of parameters, the fitting accuracy for both the on- and off- current are within 0.2% mean-squared-error. The simulated and fitted plots for on- and off- current, along with 44

59 IC, are shown in Figure 5-4. The same set of fitting parameters is used for both on- and off- currents. We see the point where IC=1 is slightly lower than V T. However, it is generally desirable to have IC=1 correspond to V DD = V T, which draws a clear boundary between strong and weak inversion regimes. When V DD = V T, the following equation holds true: (1 ) 2 ln IC V 1 T VT nt e. (5.10) By set IC = 1 under such equation, we can set parameter n to be: VT n 2 ln e. (5.11) 1 t By removing one fitting variable, means-squared-error of the fit has increased from 0.2% to 0.4%, but is more than sufficient for our optimization purposes. From the IC model, it is evident that V T plays an important role in supply-voltage optimization, for it is generally the difference between V DD and V T that determines the delay ratio, and V T also plays a critical role in determining the leakage current. It is therefore desirable to optimize V T concurrently with V DD ; however, such approach is generally not feasible in modern-day CMOS processes, as the V T is generally fixed for the given technology. However, most processes offer at least two different threshold voltages for the same technology, therefore it is interesting to compare the on-/offcurrent and fitting differences between high-v T (HVT) and low-v T (LVT) cells. We aimed to have a single set of fitting parameters for both types of transistors, which has increased the mean-squared error form 0.5 to 1.5%, but is still very reasonable. The plots for on- and off- current, along with IC, are plotted against V DD in Figure

60 IC Current (na) simulation model HVT LVT I on I leakage HVT LVT Fitting parameters: I S = μa σ = V TL = V V TH = V n = k fit = V (V) DD V T,LVT V T,HVT 10-2 LVT 10-4 HVT V (V) DD Figure 5-5: V DD vs. I on and I off for the IC model for LVT and HVT cells. 46

61 5.4 Modeling Energy under Supply Voltage Scaling In the previous chapters, sizing is the main optimization goal, and energy is simply modeled to be linearly proportional to sizing. Such assumption holds true for sizing optimizations alone, for sizing is linearly proportional to both switching (C L =W C ox ) and leakage (I S =W I S0 ) energy. Under V DD scaling, however, the relationship between energy and V DD is much more complex. Though V DD is quadratically proportional to switching energy, such case is not true for leakage energy. In addition, leakage energy-peroperation is dependent on the speed of the operation, for a slower design also spends more time consuming leakage. Under sizing optimization alone, a fast design with less than 20% increase in delay does not affect leakage by a significant amount, but when delay is increased by under aggressive V DD scaling, this exponential increase in delay eventually leads to an exponential increase in leakage energy. The energy-per-operation of a datapath can be modeled as the sum of switching and leakage energy, D EOP e e C V I V 2 sw lk L DD leakage DD. (5.12) The formulation for e lk is quite interesting, for it is a product of leakage power (I leakage V DD ) with the time duration of leakage, D/α. Parameter α here is the activityfactor (not to be confused with the α-power model), which defines the number of active (switching) clock-cycles per total clock-cycle. For most datapaths, α varies from 0.1% to 10%. Parameter D is the clock period, which is determined by the critical-path delay of the datapath. Given the formulation in (5.12), the energy-per-operation of a datapath with 47

62 1% activity factor is the sum of its switching-energy and its leakage-energy over 100 clock cycles. Evidently, low-activity datapaths are energy-inefficient, for they are required to idle for longer periods of time before a useful operation is performed. Expanding (5.12) and isolating gate sizes from V DD, we can express the energy-peroperation as n VDD V T D 2 W0 2 2 nt EOP Wi Cox VDD n Cox t e VDD i1 L. (5.13) 5.5 Optimization with Aggressive Supply Voltage Scaling With an accurate delay and energy model for all regions of transistor operation, we can now explore the energy-delay optimization space for very low-power designs, where V DD is aggressively scaled down to V T or below. As stated in previous sections, leakage energy per-operation will increase under aggressive V DD scaling due to the exponential increase in delay causing parameter D in (5.13) to increase. The point of minimum energy (MEP) is reached when the increase leakage energy under further V DD reduction equals the additional reduction in switching energy. The exact location of MEP depends on many factors, but the threshold voltage and the activity factor of the design plays a dominant role. Figure 5-6 shows two designs, with different activity factors and threshold voltage, near their respective points of minimum energy. 48

63 Energy (norm. to MEP) Energy (norm. to MEP) E total LVT = 1% E sw LVT = 1% E lk LVT = 1% MEP Delay (norm. to MDP) a) 2 E total HVT = 10% E sw HVT = 10% E lk HVT = 10% MEP Delay (norm. to MDP) b) Figure 5-6: Total, switching, and leakage energy vs. delay for (a) LVT with α=1% and (b) HVT with α=10% designs. 49

64 As expected, the LVT design in Figure 5-6 a) suffer from more leakage due to its lower threshold voltage and lower activity factor. As a result, MEP occurs around V DD of 0.355, slightly below V T of 0.376, and the total energy reduction is 13.9 lower than that of minimum-delay point (MDP). In comparison, the design in b) is much more immune to leakage due to its higher activity factor and higher V T. As a result, MEP occurs around V DD of 0.236, much lower than its V T of 0.519, and the total energy reduction is 31.2 lower than that of MDP. The large energy savings from MEP-operation seems very attractive, but such savings comes at a large penalty in performance. For the design in Figure 5-6 b), MEP costs more than in performance comparing to the minimum delay point. Fortunately, near-mep operation can often benefit from a large increase in performance with little cost in energy (in contrast to minimum-delay optimizations). It is therefore interesting to observe the effectiveness of different optimization parameters to reduce delay near MEP. Unlike energy-delay optimizations near MDP, where we are interested in the optimization variable that provides the most energy-reduction per cost in delay, we are now interested in the variable that provides the smallest energy-increment per reduction in delay. Figure 5-7 and 5-8 demonstrates the effectiveness of sizing and V DD (jointly and individually) when optimizing a LVT design from MEP. It is apparent that sizing is not effective near MEP, for their cost in energy is much larger than that for V DD scaling. As a result, the energy-delay optimization is virtually driven by V DD optimization alone, until near MDP where sizing began to take effect when V DD has reached its upper bound. It is 50

65 Vdd (V) Energy (norm. to MEP for = 10%) also apparent that a lower activity factor is detrimental in respect to energy-per-operation. opt(v DD,W i ), = 10% opt(w i ) only, = 10% opt(v DD ) only, = 10% 10 1 opt(v DD,W i ), = 1% opt(w i ) only, = 1% opt(v DD ) only, = 1% Delay (norm. to MDP) Figure 5-7: Energy vs. delay for sizing and V DD optimizations with α = 1% and 10% = 10% = 1% V T Delay (norm. to MDP) Figure 5-8: V DD vs. delay for the V DD optimization above. 51

66 Sensitivity Sensitivity Sensitivity Energy (norm. to MEP) To have a closer view of the effectiveness of sizing, V DD, and (when possible) V T in the energy-delay space, let us examine the energy-delay sensitivities of the individual variables, shown in Figure S(W i ) S(V DD ) S(V T ) Delay (norm. to MDP) Figure 5-9: Energy-delay sensitivity (left-axis) and energy-delay tradeoff (rightaxis) S(W i ) S(V DD ) S(V T ) Delay (norm. to MDP) Sens(W i ) Sens(V DD ) Sens(V ) T Energy (norm. to MEP) Figure 5-10: Energy-delay sensitivity near MDP (left) and MEP (right). 1 52

67 Figure 5-9 shows simulated energy-delay sensitivity for an adder as well as optimal E-D tradeoff. Figure 5-10 shows a detailed zoom of areas around MDP (left) and MEP (right) to compare techniques for high-performance and low-power design optimization. For the optimal E-D curve, the sensitivities of the active parameters need to be equal. For highperformance optimizations, we aim for the variable with the highest sensitivity (for largest energy reduction given delay increment), so when the sensitivity curve of a parameter deviates from the highest curve, such parameter has reached its constraint limit, and is no longer active to support further energy reduction. For low-power optimizations, we aim for the variable with the lowest sensitivity (for smallest energy penalty given delay reduction), so when the sensitivity curve of a parameter deviates from the lowest curve, such parameter has reached its constraint limit, and is no longer active to support further delay reduction. This is the case with V T and sizing at MEP, and V T and V DD at MDP. As expected, near MEP, V DD tuning has the lowest sensitivity (it has least increase in energy for a given delay reduction), and thus the most effective parameter in delay reduction. As we traverse up the E-D curve, V T tuning also becomes a significant parameter, and sizing becomes a significant parameter only under high V DD and low V T scenarios, (Figure 5-10, left), where we require high-performance designs. From the sensitivity analysis in Figure 5-9 and 5-10, it is interesting to see that V T is an active constraint throughout most of the energy-delay space. It is more effective than V DD scaling in the high-performance regime, and more effective than sizing in the low-power regime. Although V T is not easily varied in the device level, recent research such as [18] is able to vary V T using novel circuit topologies. Nevertheless, it is 53

68 Voltage (V) Energy (fj) interesting to compare the energy-delay space achievable when V T is incorporated as an optimization parameter. The optimization results are shown in Figure 5-11 and Var-VT LVT = 0.1% = 10% 10 1 HVT 10 0 Var-VT LVT HVT Delay (ns) Figure 5-11: Energy vs. delay for adder optimization through V T adjustment V DD V T 0.6 = 0.1% 0.4 = 10% Delay (ns) Figure 5-12: V DD vs. delay for adder optimization with V T as an optimization variable. 54

69 Given the freedom to vary V T, we can achieve equal or higher energy-delay efficiency than any fixed-v T cells across all regions of the energy-delay space, and for all activity factors. For the lower activity factor of 0.1%, varying V T can achieve similar energy efficiency as the HVT cells for very low power regimes while preserving the advantage of LVT cells in other regimes. For the higher activity factor of 10%, varying V T gives a clear advantage in the energy-delay space. It is evident that low activity factors are less energy efficient. This is because designs with low activity are more affected by leakage, which increases exponentially under supply voltage reduction. To obtain higher energy efficiency, it is beneficial to perform architectural optimizations to minimize the cost of leakage. For high-performance designs, however, energy is dominated by switching energy, so activity factor plays a less significant role. Since architectural changes are separate from circuit optimizations, it will be discussed in the appendix chapter. From Figure 5-11, it is evident that although HVT cells could achieve lower energy-per-operation than LVT, they carry a performance penalty comparing to LVT cells. Such high performance penalty for marginal energy reduction is highly undesirable in low-power design. For performance-constrained low-power designs, it is generally more effective to use LVT cells and operate at a lower V DD than using HVT cells, which would require a higher V DD to meet the same performance. 55

70 CHAPTER VI Optimization for Synthesized Design 6.1 Low Power Optimization for Synthesis Flow - Issues Having discussed the theoretical aspects of energy-delay optimization with respect to different optimization parameters, it is beneficial to apply the discussed techniques to more real-life designs. Design optimization of individual buffer chains and datapaths, which were given as an example in previous chapters, are seldom used in real-life design practices. Although they served as good demonstrations in providing design insights, it is nevertheless necessary to apply such techniques to logic synthesis, which is the design approach for the vast majority of modern-day logic designs. Logic synthesis and place-and-route has enabled design and implementation of very high-complexity circuits in a timeframe that would have not been possible for fullcustom designs. In the silicon industry, where time-to-market and design cost (in manhours) are critical to the success of a product, this automated-design-flow has become the integral part of the design tape-out for most companies. Unlike research-based designs, where state-of-the-art performance and power-numbers appears to be the utmost design criteria, the major criteria for commercial designs are long-term functionality and reliability, along with high manufacturing yield. Performance and power are only optimized given these major criteria are not violated. As transistor scaling continues, the 56

71 pressing issue of reliability and yield has led to greater manufacturing difficulties in the fabrication process, ever-more-stringent design rules for the physical-design flow, and is affecting logic design as well. The previous chapters have established that increasing fan-out towards the latter stages is a common result of tapering for low-power designs. However, reliability issues with electro-migration and hot-carrier-effects have placed a strict limit on transition time [19]. Since transition time is directly correlated to fan-out, the maximum fan-out of most designs is limited to This constraint has some effects on the limits of tapering, but as shown in chapters IV and V, sizing does not contribute to significant energy reduction after 20-30% delay increment, so allowing excess fan-out (at the cost of reliability) does not contribute much to additional energy savings. We see in the previous chapter that the dominant optimization parameter for low-power design is V DD scaling. In modern day CMOS, however, low V DD operation near the minimum-energy point is still very rare in commercial designs, mostly due to the increased penalty of process variability under low V DD. As V DD scales down towards V T, we observe a near-exponential increase in delay, because transistor current in near- and sub- threshold are exponentially proportional to (V DD - V T ). However, due to the nature of the manufacturing process, the silicon bodydoping cannot be precisely controlled [20], therefore it is common to observe mV of threshold voltage variations even from the same wafer. Such variation in V T may cause more than 30% timing differences under a nominal V DD of 1-1.2V, but could easily lead to timing failures when the operational V DD is only 100mV above V T. Such V T variation 57

72 also causes 10 variations in leakage, but for most commercial designs, the penalty of timing failure is much more severe than the penalty of a less power-efficient chip. To target the increased effect of process variations on low V DD designs, some have used back-gating (body-biasing) for fine-tuning V T after fabrication [21, 22], but the implementation overhead in fabrication (triple-well process) and chip-characterization is non-trivial. In other cases, some state-of-the-art synthesis libraries are providing not only the timing information (.lib file) for nominal voltages, but for an entire range of V DD for all process corners (mainly typical-nmos-typical-pmos (TT), slow-nmos-slow- PMOS (SS), and fast-nmos-fast-pmos (FF) cases; some provides slow-nmos-fast- PMOS (SF), and fast-nmos-slow-pmos (FS) information as well). However, characterizing such timing information for all standard-cells within a design kit is a nontrivial task, so most synthesis libraries that we have encountered only includes the.lib timing information for the nominal V DD (1-1.2V). As a result, it is essential to properly characterize the low-v DD performance variations of a given process technology. 6.2 Characterizing Low-V DD Performance Variations From the previous chapter, we see that near-mep operation generally does not occur below 0.3V for our 65-nm technology. As a result, we need to characterize the timing differences between TT, SS, and FF corners for V DD down to 0.3V. The characterization is first performed on a fan-out 4 inverter, similar to the cases in Chapter III. As expected, the timing differences between the FF, TT, and SS corners are 30% under 1.0V V DD, but increased to 3 as V DD scales down to 0.3V. 58

73 t p, normalized to 1.0V-TT Inverter Delay vs. V DD X 3X TT SS FF % { 30% { V DD (V) Figure 6-1: Inverter delay (t p ) vs. V DD, normalized to delay at 1.0V-TT corner. Due to such large discrepancy in delay under low V DD, it is essential that the delay-v DD relationship of the SS-corner is used for calculating synthesis timing constraints. For example, if the desired operating frequency is 1MHz under 0.4V V DD, and Figure 6-1 shows a 100 delay increase in TT-corner from 1.0V to 0.4V, but 250 delay increase in SS-corner, we must synthesize for 250MHz using the SS-timing library for 1.0V. However, characterizing delay-v DD for an inverter may not be sufficient, as [14] have suggested, complex gates with stacked PMOS and NMOS may behave differently from inverters in sub-threshold operation. One of the most sensitive stacked logic is a D-flip-flop, so we have also characterized the relationship between its clock-toq delay and V DD. Fortunately, the delay of the flip-flop scaled very similarly to that of an inverter, as shown in Figure

74 t p, normalized to 1.0V-TT Inverter and Flip-Flop Delay vs. V DD SS Inverter SS Flip-Flop TT Inverter TT Flip-Flop FF Inverter FF Flip-Flop V DD (V) Figure 6-2: Inverter delay and clock-to-q delay vs. V DD, normalized to inverter and clock-to-q delays at 1.0V-TT corner. However, having a design that meets timing in SS corner may not be sufficient. In modern day standard-cell libraries, the PMOS is generally sized to have small drivecurrent than NMOS. This creates shorter average rise/fall delays than upsizing the PMOS to have equal drive strength. Under very low V DD, however, the weaker drive-strength of the PMOS gate may result in insufficient on-current to over-power the leakage current of its NMOS counterpart. As a result, complex gates with stacked PMOS are the most likely to create failures, especially in the FS (fast-nmos-slow-pmos) corner [14]. The following two transient responses demonstrate the above scenario. Figure 6-3 shows the SS corner clock-to-q delay of a D-flip-flop under a V DD of 150mV. We see the clock-to-q delays are both in the microsecond range, with an average clock-to-q delay of 60

75 clk-to-q fall clk-to-q rise Figure 6-3: The clock-to-q transition of a D-flip-flop under 150mV SS corner. Fail Figure 6-4: The clock-to-q failure of a D-flip-flop under 165mV FS corner. 61

76 6.93 us. Such long delay is not suitable for most applications, and a supply voltage of 150mV is well below the optimal V DD for minimum energy (given this 65-nm process). Realistically, this scenario will not occur in a well-optimized design, but nevertheless the flip-flop is fully functional in the 150mV SS corner. Figure 6-4 shows the FS corner of the same flip-flop under a V DD of 165mV. Even though the supply voltage only increased by 15mV, we already observe much sharper clock transitions comparing to the 150mV case, and the NMOS transitions are much faster than PMOS as expected. However, the flip-flop is unable to function properly in this scenario. We see the data transition pulledup properly after the second positive-clock-edge, but after the clock (and input data, not shown) turned low, the flip-flop is unable to hold its value until the next positive clockedge, and produced an early data transition. Fortunately, such scenario does not occur above 0.2V for this 65-nm technology, which is lower than the E-min V DD of 0.3V, therefore using the SS model for worst-case timing characterization is sufficient 6.3 Standard-Cell Designs and the Slope Correction Model It is previously shown that the slope correction model is able to optimize the sizing of full-custom designs, but to use this model for standard-cell logic, a few changes are needed. Most standard-cell logic are synthesized by a synthesis tool, so they are generally in a netlist format, where the instantiations of each cell and its input/output connections are specified in a text file. For the Matlab environment to understand the gate instantiations and connections, some netlist-reformatting are required (Figure 6-5). 62

77 INVX8 g2653(.a (n_294),.y (Z[15])); INVX8 g2657(.a (n_291),.y (Z[13])); INVX6 g2655(.a (n_292),.y (Z[14])); NOR2X4 g2654(.a (n_282),.b (n_277),.y (n_294)); INVX8 g2665(.a (n_289),.y (Z[11])); NOR2X4 g2656(.a (n_287),.b (n_280),.y (n_292)); NOR2X4 g2658(.a (n_283),.b (n_281),.y (n_291)); INVX12 g2679(.a (n_286),.y (Z[9])); NOR2X4 g2666(.a (n_278),.b (n_267),.y (n_289)); AOI21X1 g2833(.a0 (n_49),.a1 (n_50),.b0 (n_65),.y (n_124));... IV 1 8 g2653 n_294 Z[15] IV 1 8 g2657 n_291 Z[13] IV 1 6 g2655 n_292 Z[14] NOR 2 4 g2654 n_282 n_277 n_294 IV 1 8 g2665 n_289 Z[11] NOR 2 4 g2656 n_287 n_280 n_292 NOR 2 4 g2658 n_283 n_281 n_291 IV 1 12 g2679 n_286 Z[9] NOR 2 4 g2666 n_278 n_267 n_289 AOI 2 1 g2833 n_49 n_50 n_65 n_124 Figure 6-5: The original netlist in text and the reformatted netlist in Excel. The process of re-formatting may seem tedious, but it is quite straightforward using search-and-replace functions. The tabs in text are automatically translated as a new column in excel, so the important parameters in the text file (gate type, size, name, input/output wires, etc) can be automatically assigned to the correct column by adding tabs in between them. The data from the Excel spreadsheet (text and numbers) can then be imported into Matlab by the xlsread function. Instead of specifying the transistor width of each logic gate, the standard-cell library simply uses a number to represent the gate size. Depending on the type of the 63

78 gate and the technology library, the transistor width for a specific gate size may change (e.g. transistor sizes for INV2 in 90-nm are different from the INV2 in 65-nm technology, and the transistors sizes for NAND2 and INV2 are different, even though they both have drive-strength of 2). Since the choices for gate sizes are limited in the standard-cell library, delay and energy become discrete functions of gate sizes, similar to stepfunctions. The Matlab function fmincon (used in Chapter IV and V) is ineffective for step-functions because they are very discontinuous and contain many false local-minima (the gradient remains 0 between two gate sizes). We therefore need to use the function fminsearch for this optimization; unlike fmincon, it is a simplex search method that does not use numerical or analytic gradients, therefore it applies to non-continuous functions as well [23]. Before we use the slope correction model to optimize standard-cell designs, it is interesting to first evaluate the accuracy of timing library files against our logical effort models and simulation results. In the following sections, we first compare synthesis tool estimations and logical effort models for two synthesized adders. We then optimize the adders using the slope-correction model and compare the accuracy of these estimations at various energy-delay points. 6.4 Comparison of Estimation Accuracy To perform a controlled comparison between the models and to isolate the timing errors due to tapering, two versions of a 16-bit parallel adder were synthesized using a 65-nm standard-cell library, both have about 16 logic stages with more than 300 gates. The first 64

79 version only drives 2fF load at each output, which equals its input capacitance. This acts as a reference case for the accuracy of the original logical effort model because there is very little tapering. There is actually negative tapering involved due to such small output capacitance, hence the original logical effort model estimates slightly shorter delays than the slope-correction model. The estimation errors are shown in Table 6-1. TABLE 6-1. DELAY COMPARISONS FOR TWO SYNTHESIZED ADDERS Adder Designs Original LE Delay Estimation Error % Slope-Corr. LE Synthesis Library 2fF load 4.3% 4.7% 11.2% 512fF load 41.6% 6.3% 9.7% The second version is synthesized for minimum delay with 512fF load at each output. In this case, the original logical effort model greatly over-estimates the delay, especially in non-critical paths where the fan-out is 30 or more. Yet the slope-corrected model shows only 1.6% more error than the 2fF case. The synthesis estimation is conservative by about 10% in both cases, which is a good margin to reserve for the placeand-route flow. Knowing that the slope correction model maintains its accuracy for synthesized designs with multi-fan-out datapaths, we continue to perform sizing optimizations using this model. 65

80 6.5 Sizing Synthesized Design using the Slope Correction Model To further demonstrate the slope-correction model on synthesized logic, we extended the optimization tool in Matlab to accommodate standard-cell based designs, as shown in Fig The tool first reads in the synthesized netlist and calculates the current criticalpath delay using the slope-correction model. Based on the circuit topology and standardcell library, the minimum achievable delay is determined (similar to Chapter IV). Such estimation may not be fully accurate, since it is determined by minimizing each individual path delay while assuming all branching fan-outs to be fixed. However, the estimation error is usually less than 1%. The user may then specify how much delay slack to allow; we have allowed 0-30% of slack to gather enough energy-delay information. Import netlist & create gate connections Determine minimum achievable delay Enter delay slack Initial critical path meets timing? Yes Minimize energy of critical-path gates No Resize path to meet delay Minimize energy of non-critical-path gates Critical path meet timing? Yes, exit Resize path to meet delay No Figure 6-6: Flow diagram of the optimization tool. 66

81 With the specified timing constraint, the Matlab tool locates the timing-critical paths and corrects the paths that violate timing; this process usually results in increased energy. After timing is met, gates in the critical and non-critical paths are optimized using similar methods as described in Chapter IV to minimize energy. Since the non-critical paths have more timing slack, the energy optimization tends to use smaller gates, which leads to a larger fan-out toward the output of the path. Without fan-out restrictions, it is possible to reach fan-out of up to 80 with this tool, where the delay estimation may have error up to 15%. However, such high fan-out errors only occur on non-critical paths, so the accuracy of the critical-path delay estimation is unaffected. If such large fan-out is not desired, the user may specify a maximum fan-out limit. 6.6 Comparison of Sizing Optimization Results Using the design (Fig. 6-7, A) synthesized for minimum delay as a starting point, discrete sizing optimization is performed using the slope-correction model to obtain the new minimum delay (Fig. 6-7, B). From this point on, the adder was optimized by gate sizing using the slope-correction model. The optimization is done for 5 delay targets to meet delay slack up to 30%, each design corresponds to a data point on the energy vs. delay curves in Fig The delay estimated by the slope-correction model is compared against Spectre simulations, synthesis timing, and the logical effort model. The internal power is obtained by Spectre simulation. Due to the limited sizing choices in the standard-cell library, the energy vs. delay curve is not as smooth as those for buffer chains. However, we can still achieve more than 20% of internal power reduction over 67

82 Internal power (W) the synthesized adder while maintaining the same delay. The average delay error in the slope-correction model is less than 5% compared to simulation, which is consistent with the reference error in Table A Synthesized Adder B Simulation LE with Slope Corr Synthesis Timing Original LE Model Matlab Optimized Adders Delay (normalized to the synthesized adder) Figure 6-7: Energy vs. delay for a 16-bit adder driving 512 ff of load, optimized with the slope-correction model using 65-nm standard-cell library. Gates in the non-critical paths are minimized in the optimization process, however, the synthesis timing library builds additional margin for these small gates. This led to more conservative timing estimations by the synthesis library; actual delays are shorter by up to 20%. The original logical-effort model has errors up to 40% compared to simulation, and is clearly inaccurate in the scope of this optimization. 68

83 6.7 Improving the Optimization Tool Based on the concept from Chapter V, we can perform sizing optimization of synthesized designs in conjunction with V DD optimization for a variety of performance requirements. But before modifying the optimization tool to support V DD optimization, a few changes to the optimization tool is necessary. The first issue is that the recursive timing-analysis used in Section 6.5 is infeasible for large-scale designs, for the number of possible paths is exponentially proportional to the number of gates. To address this issue, an arrival-time based [24] timing-analysis is implemented, which keeps track of the critical-path delay to each of the characterized gate as the algorithm proceeds down the datapath, only characterizing gates that have all their inputs characterized. However, to make such algorithm functional with the slope-correction model, the critical path to each gate alone is not sufficient, for the delay is also dependant on the input driver of the gate. To prevent such backward-dependency, the critical path to the current gate does not only include the timing up to this point, but also include the term s i g i-1 h i-1 from (2.4), which is the slopecorrection term for the following stage. Parameter s i is equal to 1/K of the following gate, and g i-1 h i-1 is the logical-effort parameter of the current gate. Another issue arises with using the Matlab function fminsearch for sizing optimizations. Although the function handles discontinuity, as with the case of standardcell designs, it is highly inefficient, is less likely to reach an optimal solution, and is not designed for optimization of high dimensions [23]. It would be beneficial to implement a gradient-based optimization that is still suitable for such high-dimension optimizations; although the discontinuity of the energy and delay functions (and gradients) could make 69

84 such approach difficult, an approximation to a continuous function could be used during optimizations. Figure 6-8 shows the flow diagram of the improved optimization tool. (1) Import netlist & create gate connections (2) Determine minimum achievable delay (3) Timing constraint feasible? Yes No, change constraint to minimum-feasible delay (4) Initial critical path meets timing? Yes No (5) Minimize sizing of criticalpaths within constraint (4.1) Resize path to meet delay (6) Determine initial V DD given critical-path delay slack (7) Global optimization of sizing, V DD, and V T for minimum energy (8) Fine-tune sizing of critical-paths for minimum energy (9) Yes Critical path (in standard-cell sizes) meet timing? No (10.1) Fine-tune V DD, V T for minimum energy (10.2) Increase V DD slightly to meet delay constraint Figure 6-8: Flow diagram of the improved optimization tool. 70

85 Every time a delay-estimation is required (in step (2), (4), (6), (8), and (9) of Figure 6-8), the arrival-time based timing check will be performed. Starting from the list of inputs of the design, the tool first determines the gates that are directly driven by these inputs, and the timing characterization of these gates can be performed. The algorithm then determines the next set of gate whose input gates are already characterized, and these gates will be characterized next. To avoid recursion in finding input/output gates and calculating gate-delay, a gate-information table is maintained for each gate within the design (Figure 6-9). Figure 6-9: Gate-information table for each gate. 71

86 As shown in Figure 6-9, the tool not only keeps the general and input/output information for each gate, but the critical-path delay to each of the input as well. Given this information structure, the tool can easily determine the driving and loading gate without using recursion, and the complexity is essentially O(n). Field fanout_index shows 5.1, meaning the output of this gate is input 1 of gate number 5. Field crit_path stores the list of critical-path gates up to each of the inputs, so the crit_path of the following stage would be the crit_path for the slowest input concatenated with the current gate. When the critical-path and delay of a gate is determined, this information is written to all its fan-out gates. Field g, p and K are logical-effort and slopecorrection parameters, assigned for each input. Field CapIn is the input capacitance, while cap_g and cap_p are the modeled input capacitances as a function of size. Modeling input capacitance with a linear model (instead of using discrete capacitances) makes sizing optimization easier. For function continuity, the energy optimizations will assume continuous gate sizes, and map the continuous gate sizes to the nearest standard-cell gate size after the optimization terminates. Such approach may not guarantee optimality, and the actual (standard-cell) delay may differ by up to 5%. However, having a continuous function allows the implementation of gradient-based methods [25]. Though Newton s method is commonly used for convex optimizations due to its fast convergence to high accuracy, its complexity per-iteration is roughly O(n 3 ) [26], but the complexity for 1st-order methods such as Nesterov s method is roughly O(n) per iteration. Such method is effective for global-optimizations where hundreds or thousands of gates are optimizes simultaneously. 72

87 When only the critical-path gates are sized, such as in (4.1), (5), and (8) of Figure 6-8, Nesterov s method may not be necessary, for each critical-path is optimized separately, and each path generally contains only 10 to 20 gates. However, the simplicity of Nesterov s method allows very fast convergence to a near-optimal point, usually within 20 iterations. Additional iterations may be unnecessary, for the process variations discussed in the previous section can easily outweigh the small improvement from additional iterations, and the available standard-cell sizes are limited. However, implementing a Nesterov s gradient method that works reliably is nontrivial. Because such method is an unconstrained optimization, all constraints need to be implemented as penalty functions. Namely, there are upper- and lower- bound on sizing, given the available sizes in the standard-cell-library. In addition, there are bounds on fanout, where maximum fan-out is generally around 10, and fan-out of each stage should to be equal-or-larger than that of the previous stage. Finally, energy needs to be minimized given the delay constraint of every critical path is met. Since every gate needs to have sizing and fan-out constraint, this result in 4 n penalty functions for a design of n gates, in addition to the penalties for delay constraints. If the penalties are implemented as logarithmic barriers descried in [26], where the function value reaches infinity as the barrier is reached, this can easily need to numerical instabilities due to the large number of constraints. In Nesterov s method, x (k) is updated as y (k-1) minus a step in the direction of the gradient, where y (k-1) is the momentum term. The gradient vector is calculated in Matlab by changing each variable, individually, by a very small amount, and re- 73

88 evaluating the cost function. Given this scenario, even if x (k-1) meets all the constraints, there is no guarantee that y (k-1) and x (k) will meet these constraints after the gradient step is taken. In other words, moving x (k-1) by a very small amount can effectively calculate the gradient because it does not violate the constraints, but it does not prevent x (k) from taking a step too large that it violates the constraint, and causing the log-barrier to return infinity. To address such issue, it is first necessary to reduce the number of penalty functions needed. In this optimization, x (k) has n+2 variables, where the first n variables are the number of gates in the design, and the last 2 variables are V DD and V T. Gate sizing, V DD, and V T all have their upper and lower bounds, therefore, instead of updating x (k) based on the gradient alone, each element of x (k) should also be constrained to within the upper and lower bounds. This is similar to the gradient projection method in [25], and this alone can eliminate 2 n penalty functions. The other constraints, however, have more complex interdependencies, and cannot be simply eliminated. The next step comes in modifying the penalty function. Although the log-barrier can guarantee that the constraints are not violated, such method imposes an infinite penalty on even a small violation of constraint, and the optimal value can never be at the boundary. However, given the fan-out constraints, it is not necessary that the maximum fan-out is exactly 10 or less, nor is it necessary that the fan-out of the next stage cannot be even a bit smaller than the current stage. The same situation applies to delay, and even a design that meets the delay constraint during optimization does not guarantee that the standard-cell sizing will also meet the delay constraint. When timing violation occurs 74

89 after the optimization, a small increase in V DD (in (10.1) of Figure 6-8) can solve this problem. Given such scenario where a slightly-violated constraint is tolerable, the logbarrier ( -(1/t)log(-u) ) is changed to an exponential-penalty ( (1/t)exp(s u) ) to reduce the harsh penalties and achieve more numerical stability. For example, if a large step violated a log-barrier constraint by stepping constraint u to the positive side, the cost function becomes infinity, and the gradient is evaluated to be 0, because a small change in the negative side still results in infinity. On the other hand, if an exponential-penalty is implemented, a constraint violation results in a large but finite exponential number (assuming the violation is not large enough to cause overflow), so the cost function is finite, and the gradient is non-zero it is actually very steep, so the next step will most likely rectify the violating constraint. An exponential-penalty with t=100 and s=2 is tested to be effective in this optimization. One issue that occur with steep gradient is the excessively large changes in x (k). Given the large number of constraints, changes in x (k) can easily run a barrier, which causes a large gradient in the other direction, and may cause x (k) to run into an opposing barrier. To restrain such issue, each step of x (k) is limited to change no more than 25% from the current sizing, and no more than 1-2% from the current V DD and V T (due to their high sensitivity of delay). Given such strict constraint on V DD and V T, one may wonder if the global optimization step (7) in Figure 6-8 is able to reach an optimal value within a reasonable number of iterations. In practice, it is observed that the calculated V DD from (6), based on the initial sizing from (5), is within 5% of the optimal V DD value. As a result, global optimization usually reaches an optimal value within 25 iterations, and very 75

90 low-power designs with very low V DD may need iterations. Additional iterations provide very little improvement, and it is lost once sizing is converted to standard-cell sizes. Global optimization terminates when the best solution has not been improved in the past 20 iterations, at which point it considers the current solution optimal. As shown in most low-power designs, the latter stages near the output tend to be larger than the earlier stages. Since many paths still have delay slack after globaloptimization, a post-global optimization is performed on the critical path of every output (individually) to achieve additional gains from sizing. In most cases, however, the delay slack cannot be exploited due to fan-out limits, but such small-scale optimization is very fast, and sometimes results in up to 10% additional energy savings. After the fine-tuning of critical-path gates is complete, the optimized gates are converted to the nearest standard-cell sizes. This step inevitably changes the delay characteristics, and timing is re-evaluated as a result. In the cases where timing violation occurs, V DD is increased slightly to rectify such violation, but usually not more than 1%. In the cases where timing slack is present, V DD and V T are re-optimized to attempt further energy-reduction. In the case of minimum-energy point (MEP), further energy-reduction would not be possible no matter how much additional delay slack is present. This is the case with many ultra-low-power designs. As an example, the energy-per-operation of a 1000ps adder after each optimization step is shown in Figure The numbers (5), (6), (7), (8), and (10) refer to the optimization steps in Figure 6-8. The initial design has a critical-path delay of 476ps, much faster than the timing requirement of 1000ps, and critical-path sizing and initial- 76

91 V DD reduction is able to halve the energy by exploiting the delay slack. Given this sizing and V DD, global optimization is able to halve the energy again, followed by fine-tuning of critical-path sizing and V DD. The final design operates at V DD of 0.85V, V T of 0.34V, and has energy-per-operation of merely 20% of the initial design. Initial Design Energy-per-operation (fj) Crit. Path Sizing (5) Init. Vdd Red. (6) Global Opt. (7) Fine-tune Crit. Path (8) Fine-tune V DD (10) Figure 6-10: Energy-per-operation after each optimization step. 6.8 Incorporating V DD Scaling in Optimization of Synthesized Designs Given the improved optimization tool from the previous section, we can perform optimization on the synthesized adder across the entire range of energy-delay space from the minimum-delay point (MDP) down to the minimum-energy point (MEP). The user only needs to provide one synthesized netlist and one timing constraint indicating the desired clock-period. If the clock-period provided is too small, the tool automatically sets the clock-period to the minimum-achievable delay. If the clock-period is too large, 77

92 Energy (norm. to MEP) however, the tool will optimize energy and delay until the minimum-energy point is reached. Since further delay increment beyond MEP does not result in energy savings, any delay slower than that of MEP is considered suboptimal. When the throughput requirement is slower than that of MEP, architectural transformation should be used to employ time-multiplexing to share the hardware. This approach not only reduces area, which reduces leakage, but effectively shortens the delay requirement as well. Since architectural optimization is separate from circuit-level optimizations, it will be discussed in the appendix section. The energy-delay optimization of the same 16-bit adder with activity factor of 10% is shown in Figure 6-11, from MDP down to MEP. Similar to the custom designs, switching energy dominates total energy until very low V DD values E total E sw E lk Delay (norm. to MDP) Figure 6-11: Energy-delay plot of the optimized adder. 78

93 Voltage (V) Note that leakage energy fluctuates slightly during low speed operations between delay, though theoretically it should be increasing monotonically during that period. This is due to an near-equal tradeoff between V DD and V T at those points, which results in near-equal tradeoff between switching and leakage energy, meaning more than one combination of switching and leakage energy can lead to virtually identical total energy. One can verify this scenario visually by observing that that leakage energy decreased during delay of 200 to 2000 due to increasing V T (Figure 6-12). Since increasing V T already slows down the circuit, V DD is unable to scale as fast, and switching energy is nearly constant during those period as a result. A slightly lower V T could allow more aggressive V DD scaling to save switching energy, but that would inevitably increase leakage energy. 1 V DD V T Delay (norm. to MDP) Figure 6-12: V DD vs. delay of the adder optimization. 79

94 From Figure 6-12, we see that V DD remains near its maximum value for very high-performance designs, while V T lowered to increase circuit speed. As circuit operation slows down, V DD is decreased to reduce switching energy while V T is increased to reduce the increasing effects leakage. However, V T eventually reaches its upper bound (0.5V in this case), and the point of minimum-energy is reached. Note the similarity of these curves with those for custom designs (Figure 5-12), though the ones in Figure 6-12 are not as smooth due to the limited choices for sizing from standard-cell designs. A fixed V T would not have this scenario, and leakage energy would only increase when V DD is scaled aggressively because V T cannot be increased to compensate for the increasing effect of leakage. Eventually the minimum-energy point is reached because the increase in leakage energy from a slower circuit equals the decrease in switching energy from a lower V DD. With this optimization tool, the user can determine the optimal V DD (and when possible, V T ) of their design. Knowing the delay at the optimal V DD, IC model can be used to determine the equivalent delay at nominal V DD and V T. This equivalent delay is the timing constraint for the nominal-v DD timing library, which is used by design automation tools to perform synthesis, place-and-route, and timing closure. 80

95 CHAPTER VII Conclusion In this thesis, we first discussed the issue of input slope effect, a common scenario among gates of tapered for energy reduction. Although tapering is effective in reducing gate area and energy, it introduces slope mismatch at the input and output of the gate. Since the original logical effort model assumes equal slope at the input and output, it becomes inaccurate under tapered scenarios due to its pessimistic assumption of input slope. Such assumption has caused the logical-effort model to give suboptimal designs in performing energy-delay optimizations. To target such issue, the slope-correction model is introduced; it subtracts delay based on the difference between input and output fan-out, and is shown to provide accuracy to within 5% under tapering scenarios, while the original logical effort model may have error up to 40% [27]. Downsizing the gates through tapering is effective for energy reduction of highperformance designs, but sizing optimization quickly reaches diminishes returns after a delay increment of 30% or more, especially when large loads are present at the output. To further reduce energy, supply voltage scaling ought to be included in the optimization. Supply voltage reduction can exploit delay slack of 100 or more, and effectively reduces total energy, and not just internal energy within the gates. To allow aggressive supply voltage, the current and delay model ought to be accurate for all regions of transistor operation, down to sub-threshold. The IC/EKV model is introduced, and is demonstrated 81

96 on energy-delay optimizations down to the minimum-energy point, where V DD is usually near- or sub- threshold. The exact location of minimum-energy point also depends on leakage energy and circuit activity factor, in addition to gate delay. For synthesized designs aimed for mass-production, the worst-case timing analysis ought to be used, and it must be characterized for V DD scaling. It is also essential that the minimum allowable voltage is operational in all process corners, especially the slow-pmos fast-nmos corner. Once the delay and V DD characterization is complete, the developed large-scale optimization tool is able to optimize the entire synthesized design for optimal sizing, V DD, and (when applicable) V T. Given the delay and optimal V DD, the user can determine the equivalent delay at nominal V DD, which is set as the timing constraint for synthesis tools. This thesis has focused mainly on digital design and optimizations at the circuit and logic level. However, there are many important steps in the system and architectural level that are also crucial for arriving at a good design. The appendix section of this thesis will highlight the Matlab/Simulink design flow, along with numerous useful tools such as FPGA hardware-acceleration, architectural optimization, and wordlength optimization. 82

97 APPENDIX System-Level Optimizations for Low Power Designs A.1 Simulink Design Environment To achieve energy-efficient designs, applying only circuit-level optimization is often insufficient. The architecture of the design also needs to be optimal given the design constraints and applications. However, traditional hardware design using hardwaredescription-language such Verilog or VHDL is often hand-coded, so any large changes at the architectural level generally require extensive coding, followed by detailed verification. Such large overhead often make architectural optimization very tedious and inefficient. To speed up the design process, especially when often changes are required, the Matlab/Simulink design environment is recommended. The Synopsys/Synplify DSP blockset for Simulink is shown in Figure A-1 as an example. Designs are represented in a graphical description, with connections shown as arrows, and wordlength of each block also shown. For details and features about Synplify DSP blockset, please refer to [28]. Figure A-1: Snapshot of Synplify DSP blocks in Simulink design environment. 83

98 A.2 Automated FPGA Hardware-Acceleration Before optimization is performed, it is generally the primary concern to fully verify the design. To most designers, verification is the most tedious and time-consuming step of the design process. Unfortunately software simulations are extremely slow, and even 1 second of real-time processing can easily take days to simulate by software. To address this issue, Xilinx has created a Xilinx System Generator (XSG) blockset, which is similar to Synplify DSP blocks, except they are targeted solely for FPGA applications. The XSG blockset creates a simple interface between the Matlab Simulink environment and the FPGA, where input data is sent to the FPGA from Matlab, and output data from the FPGA is gathered and returned to Matlab workspace. However, such simulation depends on synchronizing the FPGA clock with the slow internal clock from Matlab, which is almost as slow as software simulations. To achieve actual speed-up in simulation, a shared FIFO interface need to be established at the input/output boundary of the design. Figure A-2: A 32 tap FIR design with shared-fifo interface and testbench. 84

99 Figure A-2 shows a 32 tap FIR design with shared-fifo interface, along with the testbench environment. Note that there is no physical connection between the point-topoint Ethernet block and the testbench input and output data are only sent to buffers, shared memories and shared FIFOs. This allows the FPGA to operate on its own system clock, independent from the internal clock of Matlab. This approach requires Matlab interface to only send and retrieve data from the FIFOs, and the FPGA could operate at their own pace. Since there is no guarantee that the data always exist in the FIFO, a writeenable need to be added to all registers within the design, so the registers will not be updated unless the next valid data is ready. Such hardware-acceleration in XSG works quite well, however, XSG is not compatible with the ASIC design flow, as it is designated for FPGA only. Since there is no automated conversion between XSG and Synplify DSP, users would be required to recreate their Synplify DSP designs to XSG to perform FPGA emulation. Fortunately, Synplify DSP can create Verilog/VHDL code of its design as a black-box, which can be compatible with the XSG emulation flow. Other issues also arrive with creating Shared FIFOs, because every input/output port needs a read/write FIFO, a large number of wiring is needed. In addition, FIFOs can only be 16- or 32- bit wide (unsigned), so port-concatenation is necessary. The concatenated port need to be de-muxed on the FPGA side, and then assigned the correct wordlength information (signed/unsigned, binary point, etc) before sending to the input. The outputs also need to be concatenated on the FPGA, and then de-muxed in the Matlab testbench. At last, the write-enable ports need to be added to the design and the output 85

100 ports of the Shared FIFO. Such process is very tedious work and prone to errors. The user would need to verify the testbench before verifying the design. To address such large overhead in emulating a design on the FPGA, an automated FPGA hardware-acceleration tool is created for Synplify DSP designs. The user first needs to create their Synplify DSP design as a black-box with Verilog/VHDL codes, which can be done easily with click-of-a-button. Figure A-3: A Synplify DSP design created as a black-box. With the black-box created with Synplify DSP, the automated FPGA tool will automatically create the required number of Shared FIFOs, concatenate and de-mux the data when necessary, assign the correct wordlength information, and connect all the required ports. Figure A-4 shows the finishing results of the Shared-FIFO conversion for the design in Figure A-3, note the entire process is done with no required inputs from the user. The design in Figure A-4 is ready to be synthesized, which can be done by opening its System Generator block (top-left corner) and push Generate. 86

101 Figure A-4: A Synplify DSP design created as a black-box. 87

With the design ready for FPGA, the testbench would need to be created. The tool is then developed to automate the testbench-creation process as well.

102 With the design ready for FPGA, the testbench would need to be created. The tool is then developed to automate the testbench-creation process as well. Based on the Synplify DSP design, the tool automatically concatenates the inputs to be sent to buffers, which is then sent to the Shared FIFO block in the Matlab testbench. The outputs are also connected automatically in the reverse fashion. The generated testbench is shown in Figure A-5 for the design in Figure A-4. Note the grey box on the top-left corner is the instantiation of the FPGA-synthesized design from Figure A-4. Figure A-5: Testbench for the Synplify DSP design. The testbench and design in Figure A-5 is ready to be emulated on the FPGA. Comparing to the original design, simulation time is reduced from 4 minutes to 20 seconds. However, the throughput is still I/O limited, so designs with a large hardware count but few I/Os (e.g. a 200-tap FIR filter) would benefit more from this approach. 88

A.3 Architectural Optimization As we observed in this thesis, the energy-delay space near minimum-delay point and minimum-energy point are both very inefficient, and it would be preferred to operate

103 A.3 Architectural Optimization As we observed in this thesis, the energy-delay space near minimum-delay point and minimum-energy point are both very inefficient, and it would be preferred to operate near the knee of the energy-delay space. Architectural optimization is advantageous in achieving such goal by effectively relaxing the timing requirement of high-performance designs by incorporating parallelism and pipelining, or tightening the timing requirement of low-performance designs by time-multiplexing [9]. However, as introduced in Section A.1, creating such high-level architectural changes in Verilog/VHDL requires tedious recoding and verification. To target such problem, an automated architectural optimization tool is created by Rashmi Nanda in [29], which automatically determines and creates the possible architectures, given a Simulink design and its performance/energy requirements. Figure A-6: Possible transformations and valid architectures given the constraints. 89

Using MOS Models. C.K. Ken Yang UCLA Courtesy of MAH EE 215B

Using MOS Models. C.K. Ken Yang UCLA Courtesy of MAH EE 215B Using MOS Models C.K. Ken Yang UCLA yangck@ucla.edu Courtesy of MAH 1 Overview Reading Rabaey 5.4 W&H 4.2 Background In the past two lectures we have reviewed the iv and CV curves for MOS devices, both