CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

CSE241 VLSI Digital Circuits Winter 2003 Lecture 07: Timing II CSE241 L3 ASICs.1

Delay Calculation Cell Fall Cap\Tr 0.05 0.2 0.5 0.01 0.02 0.16 0.30 0.5 2.0 0.04 0.32 0.178 0.08 0.64 0.60 1.20 0.1ns 0.147ns Cell Rise Cap\Tr 0.05 0.2 0.01 0.03 0.18 0.5 0.06 0.36 2.0 0.261 0.09 0.72 Fall Transition Cap\Tr 0.05 0.2 0.01 0.01 0.09 0.5 0.03 0.27 2.0 0.147 0.06 0.54 0.5 0.33 0.66 1.32 0.5 0.15 0.45 0.90 0.12ns Fall delay = 0.178ns Rise delay = 0.261ns Fall transition = 0.147ns Rise transition = 1.0pf CSE241 L3 ASICs.2

PVT (Process, Voltage, Temperature) Derating Actual cell delay = Original delay x K PVT CSE241 L3 ASICs.3

PVT Derating: Example + Min/Typ/Max Triples Proc_var (0.5:1.0:1.3) Voltage (5.5:5.0:4.5) Temperature (0:20:50) K P = 0.80 : 1.00 : 1.30 K V = 0.93 : 1.00 : 1.08 K T = 0.80 : 1.07 : 1.35 K PVT = 0.60 : 1.07 : 1.90 Cell delay = 0.261ns Derated delay = 0.157 : 0.279 : 0.496 {min : typical : max} CSE241 L3 ASICs.4

Conservatism of Gate Delay Modeling True gate delay depends on input arrival time patterns STA will assume that only 1 input is switching Will use worst slope among several inputs Vdd A A B t F pd B D F C L Time Vdd A t pd F Time CSE241 L3 ASICs.5

This Class + Logistics Reading Smith, Chapters 15, 16 http://vlsicad.ucsd.edu/presentations/iccad00tutorial/ Possibly: Sarrafzadeh/Wong Chapters 2 - placement, 3 - routing, (4 performance modeling) Schedule - MT will be take-home (and, easy), BUT you lose 5% if you don t show up on Thursday (attendance will be taken by Ben) - Thursday: Surprise guest lecturer on floorplan / placement HW #12: Suppose that you want to work on timing edges that are most critical according to some F(slack of the edge, #paths through the edge). How would you modify the STA calculation (longest path in a DAG) so that it also calculates the number of paths through each edge? Slide courtesy of S. P. Levitan, U. Pittsburg CSE241 L3 ASICs.6

Buffer Clustering Hierarchical clustering starting from clock sinks = leaves of clustering tree Fanout at each level between 5 and 200 (depends on buffer library) Often specify a clock topology in the tool as, e.g., (1)-6-8-5 root has 6 children, each of which has 8 children, each of which has 5 (leaf) children 240 clock sinks Big question: how to perform the hierarchical buffer clustering? What makes a good cluster? Sylvester CSE241 / Shepard, L3 ASICs.7 2001

Buffer Clustering by Space Partitioning Example: Cadence CT-Gen Pick fanout (e.g., 6-4) Pick long axis of bounding box of sinks Place buffers at medians (essentially) of chunks of sinks identified by spacepartitioning Why is this good? E.g., conservative (in what sense?), easy to predict Why is it bad? E.g., wastes a lot of resources Sylvester CSE241 / Shepard, L3 ASICs.8 2001

Buffer Clustering by Traditional Clustering Example: SPC, old Cell3 CTS Pick fanout (e.g., 6) Find clusters of size 6 Place buffers at centers or centroids or of clusters Recurse Why is this good? E.g., uses less wire Why is this bad? E.g., hard to predict the results, very brittle under ECOs, HW #13: Propose a hierarchical clustering strategy for buffered clock trees, and explain its pros and cons Sylvester CSE241 / Shepard, L3 ASICs.9 2001

Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.17

Skew Reduction Using Package Most clock network latency occurs at global level (largest distances spanned) Latency Skew With reverse scaling, routing low-rc signals at global level becomes more difficult & areaconsuming Sylvester CSE241 / Shepard, L3 ASICs.18 2001

Skew Reduction Using Package µp/asic System clock Solder bump substrate Incorporate global clock distribution into the package Flip-chip packaging allows for high density, low parasitic access from substrate to IC RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring Global skew reduced Lower capacitance lower power Opens up global routing tracks Results not yet conclusive Sylvester CSE241 / Shepard, L3 ASICs.19 2001

Useful Skew (= cycle-stealing) Zero skew Useful skew FF fast FF slow FF FF fast FF slow FF Timing Slacks hold setup hold setup hold setup hold setup Zero skew Global skew constraint All skew is bad Useful skew Local skew constraints Shift slack to critical paths W. Dai, CSE241 UC Santa L3 ASICs.20 Cruz

Skew = Local Constraint Timing is correct as long as the signal arrives in the permissible skew range FF D : longest path d : shortest path FF -d + t hold < Skew < T period -D-t setup race condition safe permissible range cycle time violation W. Dai, CSE241 UC Santa L3 ASICs.21 Cruz

Skew Scheduling for Design Robustness Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge Can solve a linear program to maximize robustness = determine prescribed sink skews FF FF FF 2 ns 6 ns T = 6 ns 4 0 4 0 0 0 0 : at verge of violation 2 0 2 : more safety margin 2-2 W. Dai, CSE241 UC Santa L3 ASICs.22 Cruz

Potential Advantages of Useful Skew Reduce peak current consumption by distributing the FF switch point in the range of permissible skew CLK CLK 0-skew U-skew Affords extra margin to increase clock frequency or reduce sizing (= power) W. Dai, CSE241 UC Santa L3 ASICs.23 Cruz

Conventional Zero-Skew Flow Synthesis Placement 0-Skew Clock Synthesis Clock Routing Signal Routing Extraction & Delay Calculation Static Timing Analysis W. Dai, CSE241 UC Santa L3 ASICs.24 Cruz

Useful-Skew Flow Existing Placement U-Skew Clock Synthesis Permissible range generation Initial skew scheduling Clock tree topology synthesis Clock net routing Clock Routing Clock timing verification Signal Routing Extraction & Delay Calculation Static Timing Analysis W. Dai, CSE241 UC Santa L3 ASICs.25 Cruz

Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and used-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.26

Clock Power Power consumption in clocks due to: Clock drivers Long interconnections Large clock loads all clocked elements (latches, FF s) are driven Different components dominate Depending on type of clock network used Ex. Grid huge pre-drivers & wire cap. drown out load cap. Sylvester CSE241 / Shepard, L3 ASICs.27 2001

Clock Power Is LARGE P = α C V dd2 f Not only is the clock capacitance large, it switches every cycle! Sylvester CSE241 / Shepard, L3 ASICs.28 2001

Low-Power Clocking Gated clocks Prevent switching in areas of chip not being used Easier in static designs Edge-triggered flops in ARM rather than transparent latches in Alpha Reduced load on clock for each latch/flop Eliminated spurious power-consuming transitions during latch flow- through (transparency) Sylvester CSE241 / Shepard, L3 ASICs.29 2001

Clock Area Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area Routing area is most vital Top-level metals are used to reduce RC delays These levels are precious resources (unscaled) Power routing, clock routing, key global signals Reducing area also reduces wiring capacitance and power Typical # s: Intel Itanium 4% of M4/5 used in clock routing Sylvester CSE241 / Shepard, L3 ASICs.30 2001

Clock Slew Rates To maintain signal integrity and latch performance, minimum slew rates are required Too slow clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [T setup = 200 + 0.33 * T slew (ps)], more short-circuit power for large clock drivers Too fast burns too much power, overdesigned network, enhanced ground bounce Rule-of-thumb: T rise and T fall of clock are each between 10-20% of clock period (10% - aggressive target) 1 GHz clock; T rise = T fall = 100-200ps Sylvester CSE241 / Shepard, L3 ASICs.31 2001

Example: Alpha 21264 Grid + H-tree approach Power = 32% of total Wire usage = 3% of metals 3 & 4 4 major clock quadrants, each with a large driver connected to local grid structures Sylvester CSE241 / Shepard, L3 ASICs.32 2001

Alpha 21264 Skew Map Ref: Compaq, ASP-DAC00 Sylvester CSE241 / Shepard, L3 ASICs.33 2001

Power vs. Skew Fundamental design decision Meeting skew requirements is easy with unlimited power budget Wide wires reduce RC product but increase total C Driver upsizing reduces latency ( reduces skew as well) but increases buffer cap SOC context: plastic package power limit is 2-3 W Sylvester CSE241 / Shepard, L3 ASICs.34 2001

Clock Distribution Trends Timing Clock period dropping fast, skew must follow Slew rates must also scale with cycle time Jitter PLL s get better with CMOS scaling but other sources of noise increase - Power supply noise more important - Switching-dependent temperature gradients Materials Cu reduces RC slew degradation, potential skew Low-k decreases power, improves latency, skew, slews Power Complexity, dynamic logic, pipelining more clock sinks Larger chips bigger clock networks Sylvester CSE241 / Shepard, L3 ASICs.35 2001