Skew-Tolerant Circuit Design David Harris David_Harris@hmc.edu December, 2000 Harvey Mudd College Claremont, CA
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Example Talk based on: Skew-Tolerant Circuit Design, Morgan Kaufmann Publishers, 2001. Skew-Tolerant Circuit Design David Harris Page 2 of 50
Sequencing Overhead in Flip-Flop Systems Ideally, each full clock cycle T c is available for useful computation But we must keep token in current pipestage from catching up with token in next Purpose of sequencing elements such as flip-flops is not to remember data but merely to retard fast paths so they don t corrupt tokens further along in the pipeline This inherently retards the slower paths as well Introduces sequencing overhead inherent to keeping tokens sequenced T c CQ logic DC Flop Flop Cycle 1 For flip-flops: logic = T c CQ DC Skew-Tolerant Circuit Design David Harris Page 3 of 50
Clock Skew & Time Borrowing Clock skew increases sequencing overhead T c CQ logic DC t skew Flop Flop Cycle 1 Data may launch on the latest skewed clock edge Must setup before the earliest skewed clock edge Clock skew directly cuts into time available for computation logic = T c CQ DC t skew Any unused time at end of cycle is wasted No time borrowing possible to balance logic between stages Skew-Tolerant Circuit Design David Harris Page 4 of 50
Cycle Time Trends Much of CPU performance improvement comes from higher frequencies Clock rates increasing faster than simple process improvement allows Fixed sequencing overhead becomes greater portion of cycle time 100 1000 10 SpecInt95 1 0.1 80386 80486 Pentium Pentium II / III MHz 100 80386 80486 Pentium Pentium II / III 0.01 1985 1988 1991 1994 1997 2000 10 1985 1988 1991 1994 1997 2000 100 Fanout-of-4 (FO4) Inverter Delay (ps) 500 200 100 50 VDD = 5 VDD = 3.3 VDD = 2.5 2.0 µ 1.2 µ 0.8 µ 0.6 µ 0.35 µ 0.25 µ FO4 inverter delays / cycle 50 20 80386 80486 Pentium Pentium II / III 10 1985 1988 1991 1994 1997 2000 Process Skew-Tolerant Circuit Design David Harris Page 5 of 50
SIA Frequency Roadmap Semiconductor Industry Association has predicted clock frequencies Estimate cycle time in FO4 inverter delays Cycles are getting very short Minimizing sequencing overhead will be ever more important Process (mm) Year Frequency (MHz) Cycle time (FO4) 0.25 1997 750 (600) 13.3 (16.7) 0.18 1999 1250 (733) 11.1 (18.9) 0.15 2001 1500 (> 1500) 11.1 (<11.1) 0.13 2003 2100 9.2 0.10 2006 3500 7.1 0.07 2009 6000 6.0 0.05 2012 10000 5.0 Skew-Tolerant Circuit Design David Harris Page 6 of 50
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Example Skew-Tolerant Circuit Design David Harris Page 7 of 50
Transparent Latches Transparent latch-based systems use a latch in each half-cycle Latch is transparent when controlling clock is high, opaque when low Two latches required so tokens does not skip ahead to next pipeline stage T c _b Transparent Opaque Opaque Transparent DQ DQ _b Latch Latch Half-Cycle 1 Half-Cycle 2 logic = T c 2 DQ Skew-Tolerant Circuit Design David Harris Page 8 of 50
Transparent Latches with Skew Flip-flops are sensitive to skew because of the hard edges Data not launched until latest rising edge of clock Data must setup before earliest falling edge of clock Overhead would shrink if we could soften an edge Latch-based systems may still use entire cycle even if there is skew T c _b _b Latch Latch Half-Cycle 1 Half-Cycle 2 logic = T c 2 DQ Skew-Tolerant Circuit Design David Harris Page 9 of 50
Pulsed Latches Can we keep tokens in sequence with only one latch per cycle? Pulse the latch transparent briefly compared to delay through pipestage T c t pw φ p t pw DQ logic φ p DC t skew Case 1: pulse wider than setup time + clock skew data arrives while latch is transparent, no time wasted φ p t pw φ p Latch DC t skew wasted time Cycle 1 Case 2: pulse narrower than setup time + clock skew data may arrive while latch is opaque, leaving wasted time T c DQ if t pw DC + t skew logic = T c + t pw DQ DC t skew otherwise Skew-Tolerant Circuit Design David Harris Page 10 of 50
Min-Delay Min-Delay (a.k.a hold time or race problem) is especially insidious Setup time problems can be fixed by slowing clock Hold time problems require redesign of chip Consider min-delay constraint for flip-flops with little intervening logic δ CQ δ logic input arrives at Flop 2 t skew CD δ CQ + δ logic t skew + CD safe for Flop 2 input to change Flop 1 δ logic Flop 2 δ logic CD + t skew δ CQ Skew-Tolerant Circuit Design David Harris Page 11 of 50
Min-Delay (Transparent Latches) Latches may use nonoverlapping clocks to reduce min-delay problems But min-delay constraints exist between each latch (twice per cycle) _b δ CQ t skew Latch 1 Latch 2 t nonoverlap δ logic CD δ logic _b input arrives at Latch 2 safe for Latch 2 input to change δ CQ + δ logic t skew + CD t nonoverlap δ logic CD + t skew δ CQ t nonoverlap (in each half-cycle paths) Skew-Tolerant Circuit Design David Harris Page 12 of 50
Min-Delay (Pulsed Latches) Pulsed latches with wide pulses are especially susceptible to min-delay Data launches off earliest rising edge of pulse Must hold until after latest falling edge of pulse φ p t pw δ CQ δ logic input arrives at Latch 2 φ p t skew CD safe for Latch 2 input to change Latch 1 δ logic φ p Latch 2 δ logic t pw + CD + t skew δ CQ Skew-Tolerant Circuit Design David Harris Page 13 of 50
Time Borrowing Logic between latches may occupy more than exactly 1/2 cycle Borrow time so long as setup time at next latch is met Soft edges provide flexibility to partition and balance logic (unlike flops) T c _b T c /2 DC t borrow _b t skew Latch Combinational Logic Latch Half-Cycle 1 Half-Cycle 2 t borrow = T ----- c 2 DC t skew Skew-Tolerant Circuit Design David Harris Page 14 of 50
Time Borrowing with Pulses What if the clock duty cycle is less than 1/2 cycle (pulses or nonoverlap)? Still meet setup time before end of transparency pulse T c t pw _b T c /2 DC t borrow _b t skew t nonoverlap Latch Combinational Logic Latch Half-Cycle 1 Half-Cycle 2 t pw DC t skew pulsed latches t borrow = T ----- c t 2 nonoverlap DC t skew transparent latches Skew-Tolerant Circuit Design David Harris Page 15 of 50
Skew-Tolerant Circuits Summary Flip-Flops: Popular because CAD tools and junior engineers understand them Worst sequencing overhead and no time borrowing Transparent Latches: Better sequencing overhead Excellent time borrowing Understood by most but not all engineers and CAD tools Pulsed Latches: Lowest sequencing overhead Can be treated like flip-flops Requires thorough min-delay analysis Minimal time borrowing Element Sequencing overhead Time borrowing Min-delay δ logic Flip-flop CQ + DC + t skew 0 CD + t skew δ CQ Transparent latch Pulsed latch 2 DQ T ----- c 2 DC t skew t nonoverlap CD + t skew δ CQ t nonoverlap (in each half-cycle) DQ + max( 0, DC + t skew t pw ) t pw DC t skew t pw + CD + t skew δ CQ Skew-Tolerant Circuit Design David Harris Page 16 of 50
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Example Skew-Tolerant Circuit Design David Harris Page 17 of 50
Circuit Review CMOS circuits are slow because fat PMOS devices load input gates use precharged circuitry to remove PMOS from input Precharge: φ = 0 output set high through clocked PMOS transistor Evaluate: φ = 1 output possibly pulls low through logic transistors φ Y X A B C D A B C D φ NOR4 NOR4 φ Precharge Evaluate Precharge Y Skew-Tolerant Circuit Design David Harris Page 18 of 50
Domino Circuits inputs must be monotonically rising during evaluation Place inverting static gate between each dynamic gate / static pair is called a domino gate φ φ Domino Gate Domino gates can be safely cascaded Inputs from static CMOS logic must still be latched to avoid glitches Domino evaluation delays 1.5-2x faster than static CMOS Challenge is how to prevent precharge time from impacting critical path Traditional domino clocking approaches introduce severe overhead Skew-tolerant domino circuits can hide this overhead Skew-Tolerant Circuit Design David Harris Page 19 of 50
Traditional Domino Circuits How do we prevent precharge time from appearing in the critical path? Hide precharge time by ping-ponging between half-cycles T c _b Evaluate Precharge Precharge Evaluate _b _b _b _b Latch Latch Half-Cycle 1 Half-Cycle 2 One evaluates while other precharges Latches hold results during precharge DQ logic = T c 2 DQ Skew-Tolerant Circuit Design David Harris Page 20 of 50
Clock Skew Clock skew increases sequencing overhead Eval begins T c 1 2 1 1 2 Setup before here Latch Half-Cycle 1 DC + t skew Evaluation begins at latest rising edge Latch input setup before earliest falling edge Budget clock skew twice each cycle logic = T c 2 DC 2t skew Skew-Tolerant Circuit Design David Harris Page 21 of 50
Balancing Logic Logic may not exactly fit half-cycle T c _b _b _b _b Latch Latch Half-Cycle 1 Half-Cycle 2 Unusable time No flexibility to borrow time to balance logic between half-cycles logic = T c 2 DC 2t skew t imbalanced Skew-Tolerant Circuit Design David Harris Page 22 of 50
How bad is it? Consider 500 MHz system similar to Alpha 21164 T c = 2ns t skew = 200 ps (reported budget) DC = 50 ps (optimistic estimate) 2 ns Half-Cycle 1 Half-Cycle 2 Logic Setup Skew Logic Setup Skew 25% of cycle consumed by skew & setup! Skew-Tolerant Circuit Design David Harris Page 23 of 50
Relaxing the Timing Constraints Sequencing overhead caused by hard edges (just as for flip-flops) Data not launched until latest rising edge of clock Data must setup before earliest falling edge of clock Overhead would shrink if we could soften an edge Let s look at the role of the latch on the falling edge Latch function (1) Prevent glitches on inputs of domino gates (2) Hold results during precharge Latches would be unnecessary if we could do these tasks in another way Glitches will not occur if inputs come from domino logic Can we build domino circuits which consume the results before previous half-cycle precharges? Skew-Tolerant Circuit Design David Harris Page 24 of 50
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Example Skew-Tolerant Circuit Design David Harris Page 25 of 50
Skew-Tolerant Domino Circuits Overlap clocks so gate Y evaluates before X precharges T c φ 1 φ 2 overlap φ 1 φ 1 φ 1 φ 2 φ 2 φ 2 Latch X Y Phase 1 DC + t skew Phase 2 We can remove latches within domino pipeline Eliminate all sequencing overhead within the domino pipeline Still budget skew and setup at static / domino interface Skew-Tolerant Circuit Design David Harris Page 26 of 50
Skew-Tolerant Domino Waveforms In general, we can use N phases T c t e t p φ 1 φ 2 t overlap φ 3 φ 4 φ 1 φ 1 φ 2 φ 3 φ 3 φ 4 Phase 1 Phase 2 Phase 3 Phase 4 Each phase delayed T c /N from the previous What is the best duty cycle (t e and t p )? Skew-Tolerant Circuit Design David Harris Page 27 of 50
Precharge Time (t p ) Precharge time set by precharge of two gates in a phase with skewed clocks φ 1a t p φ 1b t prech t skew φ 1a φ 1b First gate begins precharging late Requires t prech to fully precharge Precharge must complete before second gate reenters evaluation t p = t prech + t skew Skew-Tolerant Circuit Design David Harris Page 28 of 50
Evaluation Time (t e ) Evaluation time set by next phase consuming result before first phase precharges t e φ 1b φ 2a t skew T c / N φ 1b φ 2a t hold Phase 1 Phase 2 First phase begins evaluation early Must not precharge until some hold time t hold after second phase evaluates t e = T ----- c + t N hold + t skew Skew-Tolerant Circuit Design David Harris Page 29 of 50
Skew Tolerance We ve found the best duty cycle Precharge: t p = t prech + t skew Evaluation: t e = T ----- c + t N hold + t skew Solve for the maximum tolerable skew N 1 -------------T t N c t prech t hold skew-max = -------------------------------------------------------- 2 Observations Skew tolerance increases with N Precharge and hold time cut into skew tolerance In the limit of large N and long cycles, skew tolerance ~ 1/2 cycle Skew-Tolerant Circuit Design David Harris Page 30 of 50
Local Skew Local clock domains Most circuits experience less skew in a small area than across chip Reduce skew within a phase of logic, thereby tolerating more skew globally local t p = t prech + t skew T t c global e = ----- + t N hold + t skew global N 1 local t skew-max = ------------T N c t prech t hold t skew This is the nominal overlap between phases Skew-Tolerant Circuit Design David Harris Page 31 of 50
Time Borrowing Time borrowing Excess overlap allows time borrowing into next phase (e.g. gate X) T c φ 1 φ 2 t borrow global t skew global t skew-max φ 1 φ 1 φ 1 φ 2 t borrow global t skew-max X Phase 1 Phase 2 global t skew = = N 1 ------------T N c t prech t hold local global t skew t skew Skew-Tolerant Circuit Design David Harris Page 32 of 50
Example and Summary Again consider a 500 MHz processor: T c = 2ns t skewg = 200 ps / t skewl = 50 ps t hold = 0 (conservative) t borrow (ps) 1000 750 500 250 0 Available Time Borrowing vs. # Clock Phases (assuming t prech = 500 ps) 250 583 750 2 3 4 6 8 N (# of clock phases) 917 1000 Domino Style Sequencing overhead Skew Tolerance Time borrowing Traditional Skew-Tolerant 2 DC 2t skew t imbalanced n/a 0 0 N 1 local -------------T N c t prech t hold t skew N 1 local -------------T N c t prech t hold t skew t skew global Skew-Tolerant Circuit Design David Harris Page 33 of 50
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Results Skew-Tolerant Circuit Design David Harris Page 34 of 50
Clock Generation Two-Phase Clock Generator φ 1 g φ 2 Low-skew complement generator Four-Phase Clock Generator Clock choppers φ 1 g φ 2 φ 3 Low-skew complement generator 1/4 cycle delay Optional clock choppers Inverters used to delay clock by quarter of nominal cycle time φ 4 Skew-Tolerant Circuit Design David Harris Page 35 of 50
Generator-Induced Clock Skew Variation in clock generator delays appears as clock skew 750 Complement Skew Overlap (ps) 500 250 Chopper Skew Delay Skew t borrow 0 2-Phase 4-Phase Variation in delay 20-30% relative to other clocks 4-Phase is still a large net benefit Skew-Tolerant Circuit Design David Harris Page 36 of 50
Interface with Logic Real systems mix both static and domino logic Transparent latches use φ 1 and φ 3 Pulsed latches use φ p (see book) Four-phase skew-tolerant domino uses φ 1, φ 2, φ 3, and φ 4 Which connections are legal? Domino outputs must be latched before driving static logic Use N-C 2 MOS Latch for speed and to avoid clock skew problems Q To static logic From domino logic D φ gate N-C 2 MOS latch Skew-Tolerant Circuit Design David Harris Page 37 of 50
Timing Types Timing types define legal interfaces between gates Signal suffixes define when data is legal to use, allow static verification _s indicates that a signal is stable throughout that phase _v indicates that a signal becomes valid before the end of that phase φ 1 φ 2 φ 3 φ 4 x_s23 y_v2 φ 1 φ 2 Latch x_s23 y_v2 Skew-Tolerant Circuit Design David Harris Page 38 of 50
4-Phase Skew-Tolerant Domino Timing Types gate output suffix is same as input suffix Clocked gates and latches set output suffix Element Type Clock Input Output Domino φ 1 _s1, _v4 (first) / v1 (others) _v1 φ 2 _s2, _v1 (first) / _v2 (others) _v2 φ 3 _s3, _v2 (first) / _v3 (others) _v3 φ 4 _s4, _v3 (first) / _v4 (others) _v4 Transparent Latch φ 1 _s1 _s23 φ 3 _s3 _s41 N-C 2 MOS Latch φ 2 _v2 _s34 φ 4 _v4 _s12 Skew-Tolerant Circuit Design David Harris Page 39 of 50
Example of Legal Connections φ 1 φ 2 φ 3 φ 4 φ 1 φ 3 φ 1 Latch s23 s23 Latch s23 s23 s3 s41 s41 s41 s41 s41 s23 Latch φ 1 φ 2 φ 4 NC 2 MOS s34 φ 1 φ 2 φ 2 φ 3 φ 3 φ 4 φ 4 φ 1 NC 2 MOS s12 v1 v1 v1 v1 v2 v2 v2 v2 v3 v3 v3 v3 v4 v4 v4 v4 v1 Skew-Tolerant Circuit Design David Harris Page 40 of 50
Example of Illegal Connections φ 1 φ 2 φ 3 φ 4 φ 1 φ 3 φ 1 Latch s23 Latch s23 s23 s23 s3 s41 s41 s41 s41 s41 s23 Latch φ 1 φ 1 φ 2 φ 2 φ 3 φ 3 φ 4 φ 4 φ 1 v1 v1 v1 v1 v2 v2 v2 v2 v3 v3 v3 v3 v4 v4 v4 v4 v1 Skew-Tolerant Circuit Design David Harris Page 41 of 50
Testability To test chips, we would like to be able to stop the clock and observe and control one element in each path through each cycle of logic. Stop with g low (φ 1 and φ 2 low, φ 3 and φ 4 high) Add full keeper to first φ 3 gate of each cycle to avoid floating high or low Different from half keeper of ordinary dynamic gates which only float high x_v2 from φ 2 domino weak y_v3 φ 3 φ 2 φ 3 x_v2 y_v3 output would float Skew-Tolerant Circuit Design David Harris Page 42 of 50
Transparent Latch Scan Add slave latch and access transistor to scan φ 1 static latches D φ 1 SCA SDO SDI SCB φ 1 Q slave latch Scan Procedure: 1. Stop g low 2. Toggle SCA and SCB to march data through the scan chain 3. Restart g How do we scan cycles of domino logic with no transparent latches? Skew-Tolerant Circuit Design David Harris Page 43 of 50
Domino Scan Scanning domino logic is very similar. Which gate should be scanned? SCA SDO SDI SCB φ 4 φ 4s pulldown network slave latch Next gate should be precharging so glitches do not propagate When normal operation resumes, scan data should remain until consumed Thus, choose last φ 4 domino gate of each cycle for scan Scan Procedure 1. Stop g low 2. Stop φ 4s low 3. Toggle SCA and SCB to march data through the scan chain 4. Restart g 5. Release φ 4s once scannable gate begins precharge (race) Skew-Tolerant Circuit Design David Harris Page 44 of 50
Improved Clock Generator Putting together skew-tolerant domino, stop clock, and scan: en g φ 1 φ 2 φ 3 φ 4 φ 4s SCA φ 4 R S SR latch solves the race reenabling φ 4s 1. Stop g low 2. Toggle SCA and SCB to march data through the scan chain. The first pulse of SCA will force φ 4s low. 3. Restart g. The falling edge of φ 4 will release φ 4s to track φ 4. Skew-Tolerant Circuit Design David Harris Page 45 of 50
Delayed Clocking / Delayed Reset Domino Alternative approach uses a single clock phase per gate: Locally generate clocks with matched delay buffers φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 g φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 φ 1 Skew-Tolerant Circuit Design David Harris Page 46 of 50
Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino Circuits Design Methodology Example Skew-Tolerant Circuit Design David Harris Page 47 of 50
Example 64-bit superscalar ALU self-bypass path _b _b _b x2 64-bit Adder Latch 1 mm 1 mm 2 mm x4 Other Result Mux ALU Block Bypass Mux (150 ff) Latch Traditional Domino To Data Cache φ 1 φ 2 φ 3 φ 3 φ 4 x2 64-bit Adder 1 mm 1 mm 2 mm x4 Other Result Mux ALU Block Bypass Mux (150 ff) Skew-Tolerant Domino To Data Cache Skew-Tolerant Circuit Design David Harris Page 48 of 50
Simulation Results Adders were simulated in 0.6µ HP14 technology Time (FO4 inverter delays) 15.0 13.0 18.6 16.6 No Skew 1 FO4 local skew 11.9 11.9 Traditional Latency Traditional Cycle Time Skew-tolerant Latency Skew-tolerant Cycle Time Skew-tolerant domino 25% + faster Skew-Tolerant Circuit Design David Harris Page 49 of 50
Skew-Tolerant Domino Conclusions Domino is very attractive for high speed ICs Traditional domino limited by sequencing overhead Latch delay Clock skew Inability to balance logic through time borrowing Skew-tolerant domino eliminates overhead Use overlapping clocks and remove latches Overhead at static / domino interface motivates building critical loops entirely from domino Domino design issues Four-phase skew-tolerant domino is a reasonable choice Timing types specify legal connections among static and dynamic gates Latchless pipelines can still be fully scanned for testability Simple clock generators Now in widespread use among top design teams DEC Alpha ALUs, Intel OTB Domino, IBM Delayed-Reset Domino, Sun Delayed Clocking, etc. Skew-Tolerant Circuit Design David Harris Page 50 of 50