EE M216A.:. Fall 2010 Lecture 12 Clocking Issues: istribution, Energy Prof. ejan Marković ee216a@gmail.com Clock istribution Goals: eliver clock to all memory elements with acceptable skew eliver clock edges with acceptable sharpness Clock network design is another one of the big challenges in the design of a large chip Clocks are generally distributed via wiring trees Want to use low resistance interconnect to minimize delay Use multiple drivers to distribute driver requirements Use optimal sizing principles to design buffers Clock lines can create significant crosstalk Lecture 12:. Clocking Markovic Issues / Slide 2 2
Issues in Clock istribution Network Skew Process, voltage, temp. ata dependence Noise coupling Load balancing Power, CV 2 f (no ½ or α) Clock gating Flexibility/Tunability Compactness fit into existing layout/design Reliability Electromigration Skew from distribution See videos from P. Restle (IBM) on classwiki Lecture 12:. Clocking Markovic Issues / Slide 3 3 Clock istribution Methods RC Tree Less capacitance More accuracy Fl ibl i i Grids Reliable Less data dependency T bl (l t i d i ) Flexible wiring Tunable (late in design) Shown here for final stage drivers driving F F loads Lecture 12:. Clocking Markovic Issues / Slide 4 4
RC Trees, Clock istribution Observe: only relative skew is important H Tree X Tree Binary Tree Asymmetric trees that match RCs can be used CLOCK H-Tree Network Lecture 12:. Clocking Markovic Issues / Slide 5 5 More Realistic H Tree [Restle98] Lecture 12:. Clocking Markovic Issues / Slide 6 6
The Grid System No RC matching Large power river GCLK clock grid EC Alpha Examples GCLK river river GCLK 21064 21164 river GCLK 21264 Lecture 12:. Clocking Markovic Issues / Slide 7 7 Examples of istribution H Tree Asymmetric RC Tree IBM Grids EC [Alphas] Serpentines Intel x86 [Young ISSCC97] Lecture 12:. Clocking Markovic Issues / Slide 8 8
Examples of Processor Chips EC Alpha 21064 clock spines Note: reverse Z axis EC Alpha 21064 RC delays EC Alpha 21164 RC delays for global distribution Spine + Grid EC Alpha 21164 RC local delays Lecture 12:. Clocking Markovic Issues / Slide 9 9 Example: EV5 (Alpha 21164) Clocking (1995) t cycle = 3.3ns t rise = 0.35ns t skew = 150ps Clock rivers Single phase clocking 2 distributed driver channels Reduced RC delay/skew Improved thermal distribution 3.75 nf clock load 58 cm final driver width Local inverters for latching Conditional clocks in caches to reduce power More complex race checking evice variation Lecture 12:. Clocking Markovic Issues / Slide 10 10
Example: EV6 (Alpha 21264) Clocking (1998) 600MHz, 0.35µm CMOS t cycle = 1.67ns t rise = 0.15ns Global clock waveform t skew = 50ps rise skew p PLL Multiple conditional buffered clocks 2.8 nf clock load 40 cm final driver width Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking Lecture 12:. Clocking Markovic Issues / Slide 11 11 21264 Clocking Lecture 12:. Clocking Markovic Issues / Slide 12 12
EV6 Clock Results ps 5 10 15 20 25 30 35 40 45 50 ps 300 305 310 315 320 325 330 335 340 345 GCLK Skew (at V /2 Crossings) GCLK Rise Times (20% to 80% Extrapolated to 0% to 100%) Lecture 12:. Clocking Markovic Issues / Slide 13 13 EV7 Clock Hierarchy, 2002 152 million transistors, 15/137 logic/memory Active Skew Management and Multiple Clock omains L2L_CLK (L2 Cache) LL NCLK (Mem Ctrl) GCLK (U Core) LL PLL LL L2R_CLK (L2 Cache) Widely dispersed drivers LLs compensate static and low frequency variation ivides design and verification effort LL design and verification is added work Tailored clocks SYSCLK Lecture 12:. Clocking Markovic Issues / Slide 14 14
Alpha Processors Case Study EV4 (21064) 0.75µm, 200 MHz ~ 1992 Single global clock driver, 5 levels of buffering 35 cm driver, 3.25 nf, 40% power EV5 (21164) 0.5µm, 300 MHz ~ 1995 One central, two side clock drivers 58 cm driver, 3.75 nf, 40% power EV6 (21264) 0.35µm, 600 MHz ~ 1998 Clock grid, 4 window panes, hierarchical, gated clock domains 40 cm driver, 2.8 nf EV7 0.18µm, 1.2GHz~ 2002 Multiple clock domains, LLs Lecture 12:. Clocking Markovic Issues / Slide 15 15 Intel Processors Pentium II Pentium III Pentium 4 MPR Issue June 1997 April 2000 ec 2001 Clock Speed 266 MHz 1GHz 2GHz Pipeline Stages 12/14 12/14 22/24 Transistors 7.5M 24M 42M Cache (I//L2) 16k/16K/ 16K/16K/256K 12K/8K/256K ie Size 203mm 2 106mm 2 217mm 2 IC Process 0.25µm, 4M 0.18µm, 6M 0.18µm, 6M Max Power 27W 23W 67W Increasing clock speeds and die size Balancing skew using simple RC trees becoming less effective Insertion delay 7 8FO4 due to increased die Clock skew control harder to due to increasing PVT variations Use of active deskewing circuits Lecture 12:. Clocking Markovic Issues / Slide 16 16
IA 32 Pentium Pro: Active eskewing Ext FB CLK Gen elay Line elay SR eskew Control elay Line elay SR Core Lef ft Spine P Rig ght Spine Clock distribution network with deskewing circuit [Geannopoulos and ai 1998] Lecture 12:. Clocking Markovic Issues / Slide 17 17 IA 32 Pentium Pro: elay Shift Register In Load<1:15,2> elay Line Load<0:14,2> Out <1:15,2> <0:14,2> elay shift register [Geannopoulos and ai 1998] elay Shift Register Lecture 12:. Clocking Markovic Issues / Slide 18 18
IA 32 Pentium Pro: Phase etector Right Bandwidth Control elay = n Phase etector 1 Left ftleads Left elay = n Phase etector 2 Right Leads Phase detector [Geannopoulos and ai 1998] Lecture 12:. Clocking Markovic Issues / Slide 19 19 First IA 64 Processor: Clock istribution PLL RCs PLL Clock distribution topology [Rusu and Tam 2000] Core Clock Reference Clock eskew Cluster PLL generates internal clock 2x frequency distribution architecture Balanced global clock tree Multiple deskew buffers Multiple local clock buffers Lecture 12:. Clocking Markovic Issues / Slide 20 20
Result of Active eskewing Measured regional clock skew [Rusu and Tam 2000] Lecture 12:. Clocking Markovic Issues / Slide 21 21 Some Energy Reduction Ideas Clock gating Global Local Low swing clocking Reduced swing drivers CSE redesign N only CSE w/ low swing ual edge triggering Latch mux Pulsed latch Lecture 12:. Clocking Markovic Issues / Slide 22 22
Global Clock Gating Used to save clocking energy when data activity is low (a) In 0 1 S (b) In Load Time mux (no gating) REG EN REG Global Gating (a) Nongated clock circuit, (b) gated clock circuit. [Kitahara et al. 1998] Lecture 12:. Clocking Markovic Issues / Slide 23 23 Local Clock Gating (Requires FF Redesign) Same concept, control at a fine level of granularity (single FF) Pulse Generator I M Clock Control P 1 ata-transition look-ahead latch [Nogawa and Ohtomo, 1998] ata Transition Look Ahead Lecture 12:. Clocking Markovic Issues / Slide 24 24
Local Clock Gating, Another Example Clock enabled only when * * S S' R' R N N 1 Conditional-capture FF [Kong et al. 2000] S' R R' S Lecture 12:. Clocking Markovic Issues / Slide 25 25 Gated Transmission Gate M S Latch Concept: inhibit clock when = comp= XNOR comp = 0 ( ), M S Latch comp = 1 ( = ), = 0, 1 = 1 latches closed, no output change, no internal power S M 1 comp 1 M 0.5 * 0.5 0.5 0.5 * 1 S S comp S 0.5 0.5 S S S Gated M-S latch [Markovic et al. 2001] Lecture 12:. Clocking Markovic Issues / Slide 26 26
Gated TG MS Latch: Timing and Energy Setup (U) and Hold time (H) Energy Increased t setup in gated MSL due to inclusion of the comparator into the critical path slower than conventional MSL Smaller energy per transition if switching activity of is <0.3 For higher α, comparator and gen dominate the energy Lecture 12:. Clocking Markovic Issues / Slide 27 27 ata Transition Look Ahead (TLA) Pulsed Latch Pulsed latch, pulses are gated with XOR TLA circuit If P 1 = 0, circuit operates as a conventional pulsed latch If = P 1 = 1 = 0, no change or energy consumption XOR and Control in the critical path large t setup, and t M Pulse Generator I Clock Control P 1 [Nogawa and Ohtomo, 1998] ata Transition Look Ahead Lecture 12:. Clocking Markovic Issues / Slide 28 28
TLA L: Analysis of Energy Consumption M TLA L without clock gating Pulse Generator I C in (CC) TLA L Pulse Generator α 1 α E = ( E + E ) + E + E CMSL 01 10 Cin 2 2 E L FF α = ( E 2 0 1 + E 1 0 1 α ) + E 2 idle = E idle + ECLK + EL CC + E E E 0 1 + int ext 1 0 = E idle + ECLK + EG + Eint E + E N PG E idle = E0 0 = E1 1 = + E Cin PG shared among N TLA L s α input switching activity Lecture 12:. Clocking Markovic Issues / Slide 29 29 Energy Comparison: TLA L vs. Conventional M S TLA L is more energy efficient than CMSL when N > 2 and α < 0.25 E(TLA L) < E(CMSL) E(TLA L) > E(CMSL) Lecture 12:. Clocking Markovic Issues / Slide 30 30
Clock on emand (CO) Pulsed Latch Pulsed latch, pulses gated with XNOR TLA ckt If XNOR=0, 1 when, and 0 after has changed to If = XNOR=1 =0, no change or energy consumption ata Transition Look Ahead XNOR Pulse Generator includes clock control Cannot be shared among multiple PL s Pulse Generator [Hamada et al., 1999] Lecture 12:. Clocking Markovic Issues / Slide 31 31 Energy Efficient Pulse Generator in CO PL XNOR 1 "0 C int "1 " "1" Straightforward realization with CMOS gates C int switches in each cycle Energy inefficient XNOR "0 inv inv XNOR Compound AN NOR "1" Compound AN NOR gate Energyefficient Lecture 12:. Clocking Markovic Issues / Slide 32 32
Circuit Sizing Impacts Energy Efficiency of CO PL CO PL more effective for high speed due to large transistors [Markovic et al., 2001] Lecture 12:. Clocking Markovic Issues / Slide 33 33 Conditional Capture Flip Flop (CCFF) First stage: pulse generator with internal clock gating When =1, S=R=1 When =1, 1=0, S can switch low if =1, =0, R can switch low if =0, =1 Otherwise, S=R=1 no energy consumed Second stage: pass gate implementation of SAFF No t setup degradation due to clock gating S * * S R R N N 1 S R R S [Kong et al. 2000] Lecture 12:. Clocking Markovic Issues / Slide 34 34
Timing Comparison elay Race Immunity (R) esigns with gating / elay: Relative delay constant w/ V Large delay due to long t setup Internal race immunity: R(FF) < R(MSL) < R(Gated MSL) CO PL bad (wide pulse) [Markovic et al. 2001] Lecture 12:. Clocking Markovic Issues / Slide 35 35 Low Swing Clocking: river Re design Half swing clock drivers: 50% power reduction (minus some penalty in clock drivers) V T C p1 B C p2 C A V V thp T B GN CNT C n1 CNB C n2 H V C B V thn GN CNT CNB [Kojima at al. 1995] Lecture 12:. Clocking Markovic Issues / Slide 36 36
N only Clocked Latches Bring clock only to n MOS transistors to allow reduced clock swing without conflict with partially turned off p MOS transistors Reduced clocking energy with some penalty in performance is in the critical path S SM M N PL (c) SS d 1 N 1 N 2 S N MSL (a) N FF (b) N PPL (d) Pulse Generator (b) (d) (a) conventional TG MSL, (b) pulsed latch, (c) conventional PL, (d) push pull PL Lecture 12:. Clocking Markovic Issues / Slide 37 37 Low Swing FFs: Energy elay Comparison Full swing: PL preferred for high speed MSL preferred for low energy Low swing clock: N FF preferred for high speed N PPL is preferred for low energy 130nm technology, C L = 50fF, C in 12.5fF, α = 0.1 cycle (norm) Energy / / cycle (norm) Energy 1.2 1.0 0.8 0.6 0.4 0.2 0.00 0.8 0.6 0.4 0.2 0 PL N-PPL N-FF N-PPL N-FF MSL N-PL High-Vdd Low-Swing NMSL N-MSL 0 1 2 3 4 5 6 ata-to- delay (FO3 inverter delay) [Markovic et al. 2004] Lecture 12:. Clocking Markovic Issues / Slide 38 38
Effect of Clock Noise on Latch elay All latches fail for clock noise > 12% of clock voltage N FF gives best clock noise rejection delay degradation - 20% 16% 12% 8% 4% 0% N-CL N-PPL N-FF 0% 3% 6% 9% 12% Noise on low-swing clock Lecture 12:. Clocking Markovic Issues / Slide 39 39 ual Edge Triggering: Latch Mux Saves clocking energy regardless of data activity! Extra Mux longer delay C 0 1 S C Lecture 12:. Clocking Markovic Issues / Slide 40 40
ET Latch Mux: Circuit Example Pass gate latches One transparent when = 0 One transparent when = 1 Pass gate multiplexer l that t selects the output t of the opaque latch [Llopis and Sachdev, 1996] Lecture 12:. Clocking Markovic Issues / Slide 41 41 C 2 MOS Latch Mux (C 2 MOS LM) C 2 MOS latches One transparent when = 1 One transparent when = 0 N4 N2 Multiplexer: two C 2 MOS inverters that propagate the output of the opaque latch Large clock transistors shared between the latches and the multiplexer N5 N2 N3 N4 [Gago et al., 1993] N3 N5 Lecture 12:. Clocking Markovic Issues / Slide 42 42
ual Edge Triggering: Pulsed Latch First stage: pulse generator Second stage: capturing latch Pulse Gen C C Pulse Gen C Lecture 12:. Clocking Markovic Issues / Slide 43 43 ET Pulsed Latch Circuit Example 2 2 1 1 1 1 1 1 2 2 2 1 2 1 (a) Pulse generator transparent to only when = 1 = 1, or when = 2 = 1 shortly after both edges of the clock ET PL consumes lot of energy for four clocked pass gates To improve speed, modified from original design (Strollo et al., 1999) which implemented n MOS pass gate and p MOS keeper Lecture 12:. Clocking Markovic Issues / Slide 44 44 (b)
ual Edge Triggered Flip Flop Based on SR latch S C R S CL C R Lecture 12:. Clocking Markovic Issues / Slide 45 45 ET Flip Flop Circuit Example Two pulse generators: X active at rising edge of the clock, Y active at falling edge of the S X and S Y alternately l precharge / evaluate At any moment, one of S X and S Y keeps the value of data sampled at the most recent clock edge The other S X or S Y is precharged high 1 1 st STAGE: X 2 nd STAGE 1 st STAGE: Y S X S Y 1 1 2 2 Pulses at S X and S Y have same width as clock Second stage is a simple NAN gate no need for a latch Lecture 12:. Clocking Markovic Issues / Slide 46 46
SET vs. ET: elay Comparison 6 5 SE E elay [F FO4] 4 3 2 1 0 MSL/LM C2MOS-LM PL SPGFF Latch MUX s have two equally critical paths, somewhat shorter than that of MSL PL is more complex, adding more capacitance to the critical path compared to SET PL SPGFF has short domino like critical path fastest Lecture 12:. Clocking Markovic Issues / Slide 47 47 SET vs. ET: Power Consumption LM s benefit from clever implementation of latch mux structure with clock transistors sharing 180 160 140 120 Non-clk PL adds extra highactivity capacitance compared to SET PL Power [uw] 100 80 60 40 SPGFF power consumption is in the 20 middle, mainly due to 0 alternate switching of nodes SX and SY M SL LM PL SE PL E C2M OS SE C2M OS E SPGFF (0.18 µm, 500MHz for SET, 250MHZ for ET, high load) Lecture 12:. Clocking Markovic Issues / Slide 48 48
SET vs. ET: EP Comparison fj/250mhz] EP [fj/500mhz], [f 60 50 40 30 20 10 0 Single Edge ouble Edge M SL/LM C2M OS PL SPGFF Latch MUX s have similar il or better EP than their SET peers PL exhibits worse delay and energy compared to SET PL, due to more complex design SPGFF is fastest with moderate energy consumption: lowest EP EP (SPGFF) < EP (LM) < EP (PL) Lecture 12:. Clocking Markovic Issues / Slide 49 49 Clock istribution for ET H tree clock distribution with L levels of clock buffers Example below: L = 3 s = Storage Element s/2 L 1 s PLL s/2 L 1 M/4 L 1 storage elements Lecture 12:. Clocking Markovic Issues / Slide 50 50
SET vs. ET: Clocking Power lk (SET) P (ET) / P Cl 1.2 1.0 08 0.8 0.6 0.4 SET better ET better 0.2 1 2 3 0.0 0 0.3 0.6 0.9 1.2 1.5 C CSE,SE / C wire L ET wins if the total clock load capacitance is < 2x the capacitance of a SET design Lecture 12:. Clocking Markovic Issues / Slide 51 51