, Mark Horowitz 1, & Dean Liu 1 David_Harris@hmc.edu, {horowitz, dliu}@vlsi.stanford.edu March, 1999 Harvey Mudd College Claremont, CA 1 (with Stanford University, Stanford, CA)
Outline Introduction Timing Analysis Formulation Timing Verification Algorithm Results Conclusion
Introduction Clock skew, as a fraction of the cycle time, is a growing problem for fast chips Fewer gate delays per cycle Poor transistor length, threshold tolerances Larger clock loads Bigger dice
Introduction Clock skew, as a fraction of the cycle time, is a growing problem for fast chips Fewer gate delays per cycle Poor transistor length, threshold tolerances Larger clock loads Bigger dice The designer may: Reduce skew Very hard; clock networks are already well optimized
Introduction Clock skew, as a fraction of the cycle time, is a growing problem for fast chips Fewer gate delays per cycle Poor transistor length, threshold tolerances Larger clock loads Bigger dice The designer may: Reduce skew Very hard; clock networks are already well optimized Tolerate skew Flip-flops and traditional domino circuits reduce cycle time by skew es and skew-tolerant domino can hide modest amounts of skew
Introduction Clock skew, as a fraction of the cycle time, is a growing problem for fast chips Fewer gate delays per cycle Poor transistor length, threshold tolerances Larger clock loads Bigger dice The designer may: Reduce skew Very hard; clock networks are already well optimized Tolerate skew Flip-flops and traditional domino circuits reduce cycle time by skew es and skew-tolerant domino can hide modest amounts of skew Only budget necessary skews Skew between nearby latches is often much less than skew across die Need better timing analysis for different skews between different latches
Timing Analysis Formulation Build on Sakallah, Mudge, Olukotun (SMO) analysis of latch-based systems. System contains: k clocks C = {, φ 2,..., φ k ) l latches L = {L 1, L 2,..., L l } φ 2 φ 2 L 1 L 2 L 3
Clock Waveforms 0 T c φ 2 φ 2 L 1 L 2 L 3 T c : cycle time
Clock Waveforms 0 T c Tφ1 = T c /2 T φ2 = T c /2 φ 2 φ 2 L 1 L 2 L 3 T c T φi : cycle time : duration for which φ i is high
Clock Waveforms 0 T T φ1 = T c /2 c φ 2 s φ1 = 0 s φ2 = T c /2 T φ2 = T c /2 φ 2 L 1 L 2 L 3 T c T φi s φi : cycle time : duration for which φ i is high : start time, relative to beginning of common clock, of φ i being high
Clock Waveforms 0 T T φ1 = T c /2 c φ 2 s φ1 = 0 s φ2 = T c /2 T φ2 = T c /2 S φ1φ2 = S φ2φ1 = -T c /2 φ 2 L 1 L 2 L 3 T c T φi s φi S φi φ j : cycle time : duration for which φ i is high : start time, relative to beginning of common clock, of φ i being high : phase shift from φ i to next occurrence of φ j. Used to translate times relative to particular clock phases.
Variables p i : clock phase controlling latch i φ 2 φ 2 L 1 L 2 L 3 p 1 : 1 p 2 : 2 p 3 : 1
Variables p i : clock phase controlling latch i DCi : setup time for latch i φ 2 φ 2 L 1 L 2 L 3 p 1 : 1 p 2 : 2 p 3 : 1
Variables p i : clock phase controlling latch i DCi DQi : setup time for latch i : propagation delay through latch i φ 2 φ 2 L 1 L 2 L 3 p 1 : 1 p 2 : 2 p 3 : 1
Variables p i : clock phase controlling latch i DCi DQi Assume: : setup time for latch i : propagation delay through latch i T c DQ = 1000 ps = 80 ps φ 2 0 500 1000 φ 2 L 1 L 2 L 3 12 = 670 23 =70 p 1 : 1 p 2 : 2 p 3 : 1 ij : propagation delay through logic between latches i and j
Variables p i : clock phase controlling latch i DCi DQi Assume: : setup time for latch i : propagation delay through latch i T c DQ = 1000 ps = 80 ps φ 2 0 500 1000 φ 2 L 1 L 2 L 3 12 = 670 23 =70 p 1 : 1 p 2 : 2 p 3 : 1 A 1 : 0 ps A 2 : 250 A 3 : -100 ij A i : propagation delay through logic between latches i and j : arrival time at latch i, relative to start of p i
Variables p i : clock phase controlling latch i DCi DQi Assume: : setup time for latch i : propagation delay through latch i T c DQ = 1000 ps = 80 ps φ 2 0 500 1000 φ 2 L 1 L 2 L 3 12 = 670 23 =70 p 1 : 1 p 2 : 2 p 3 : 1 A 1 : 0 ps A 2 : 250 A 3 : -100 D 1 : 0 D 2 : 250 D 3 : 0 ij A i D i : propagation delay through logic between latches i and j : arrival time at latch i, relative to start of p i : departure time from latch i
Variables p i : clock phase controlling latch i DCi DQi Assume: : setup time for latch i : propagation delay through latch i T c DQ = 1000 ps = 80 ps φ 2 0 500 1000 φ 2 L 1 L 2 L 3 12 = 670 23 =70 p 1 : 1 p 2 : 2 p 3 : 1 A 1 : 0 ps A 2 : 250 A 3 : -100 D 1 : 0 D 2 : 250 D 3 : 0 Q 1 : 80 Q 2 : 330 Q 3 : 80 ij A i D i Q i : propagation delay through logic between latches i and j : arrival time at latch i, relative to start of : departure time from latch i : output time of latch i p i
Timing Constraints Departure: i L D i = max( 0A, i )
Timing Constraints Departure: i L D i = max( 0A, i ) Output: i L Q i = D i + DQi
Timing Constraints Departure: i L D i = max( 0A, i ) Output: i L Q i = D i + DQi Arrival: ij, L A i = max( Q j + ji + S pj p ) i
Timing Constraints Departure: i L D i = max( 0A, i ) Output: i L Q i = D i + DQi Arrival: ij, L A i = max( Q j + ji + S pj p ) i Propagation Constraints: ij, L D i = max( 0max, ( D j + DQj + ji + S pj p ) i
Timing Constraints Departure: i L D i = max( 0A, i ) Output: i L Q i = D i + DQi Arrival: ij, L A i = max( Q j + ji + S pj p ) i Propagation Constraints: ij, L D i = max( 0max, ( D j + DQj + ji + S pj p ) i Setup Constraints: i L D i + DCi T pi
Clock skew is the difference between nominal and actual interarrival times of a pair of clocks. Enlarge set of physical clocks C to model skew between nominally identical clocks. Example: φ 2a L 3 D 3 C = {a, φ 2a, b, φ 2b } local t skew within domains a 4 L 4 D 4 b 6 L 6 D 6 global t skew between domains 5 7 φ 2a L 5 D 5 φ 2b L 7 D 7 ALU (clock domain a) Data Cache (clock domain b)
Single Skew Formulation Easy and conservative to budget global skew everywhere Effectively increases setup time at each latch Setup Constraints: global i L D i + DCi + t skew T pi Too conservative for high-speed designs with big global skews
Exact Skew Budgets How much skew must be budgeted? L 3 to L 4 : local skew φ 2a L 3 D 3 4 6 a L 4 D 4 b L 6 D 6 5 7 φ 2a L 5 D 5 φ 2b L 7 D 7 ALU (clock domain a) Data Cache (clock domain b)
Exact Skew Budgets How much skew must be budgeted? L 3 to L 4 : L 7 to L 4 : local skew global skew φ 2a L 3 D 3 4 6 a L 4 D 4 b L 6 D 6 5 7 φ 2a L 5 D 5 φ 2b L 7 D 7 ALU (clock domain a) Data Cache (clock domain b)
Exact Skew Budgets How much skew must be budgeted? L 3 to L 4 : local skew L 7 to L 4 : global skew L 5 to L 4 through transparent L 6, L 7 : local skew Must track launching clock to determine skew budget D 3 φ 2a L 3 4 6 a L 4 D 4 b L 6 D 6 5 7 φ 2a L 5 D 5 φ 2b L 7 D 7 ALU (clock domain a) Data Cache (clock domain b)
Exact Skew Formulation Define arrival and departure times with respect to launching clocks: A i c D i c φ i, φ j t skew : arrival time at latch i for path launched by clock c : departure time from latch i for path launched by clock c : skew between clocks φ i, φ j
Negative Departure Times Must now allow negative departure times with respect to other clocks: Path from L 5 to L 7 is earlier than L 6 to L 7, but sees more skew, miss setup Reaches L 6 at -50 ps, but L 6 may be transparent by then because of skew Departure times w.r.t. latch s own clock still must be nonnegative φ 2a L 3 T c = 1000 ps a φ 2a 4 6 450 ps L 4 b L 6 5 7 950 ps L φ 5 2b L 7 2a D 6 1b D 6 2a D 7 1b D 7 = 50 = 0 = 400 = 450 ALU (clock domain a) Data Cache (clock domain b) 2a, 2b t skew 1b, 2b t skew = = DC = 0 DQ = 0 150 ps 25 ps
Exact Constraints with Skew: Propagation Constraints (single skew): ij, L D i = max( 0max, ( D j + DQj + ji + S pj p ) i Setup Constraints (single skew): global i L D i + DCi + t skew T pi Propagation Constraints (exact skew): ij, Lc, C if c = p i c c then D i = max( 0max, ( D j + DQj + ji + S pj p )) i c c else D i = max( D j + DQj + ji + S pj p ) i Setup Constraints (exact skew): i Lc, C D i c + + DCi cp, i t skew T pi
Other Timing Constraints Flip-flops: No transparency, easier than latches Still budget skew between launching and receiving clocks Min-delay: Only requires checks between consecutive pairs of clocked elements Standard verification algorithms work if proper skew is used
Verification Algorithm Check constraints with generalized Szymanski-Shenoy relaxation algorithm 1 For each latch i: p 2 i max max D i = 0 ; Di = 0; c i = // initialize departure times 3 Enqueue 4 While queue is not empty 5 Dequeue 6 For each latch i in fanout of j c 7 A = D j + + + // calculate arrival time c 8 If A> D i AND c max i, c // is it possibly critical? cp, 9 If ( + + i > ) // does it violate setup time? 10 Report setup time violation 11 Else D i p i D j c DQj ji S pj p i c c 12 D i = A; Enqueue D i // keep following path max max 13 If ( A> D i ) D i A; p i ( ) A t + skew A DCi t skew T pi > D i max max = c i = c
Results Analyzed MAGIC: Memory & General Interconnect Controller of FLASH supercomputer Assume local t skew global = 250ps t skew Model A: As designed, from MAGIC.sdf database Model B: Flops converted to latch pairs, logic balanced between pairs = 500ps
Results Analyzed MAGIC: Memory & General Interconnect Controller of FLASH supercomputer Assume local t skew global = 250ps t skew Model A: As designed, from MAGIC.sdf database Model B: Flops converted to latch pairs, logic balanced between pairs CPU time < 1 second in all cases = 500ps Model A # Flip-Flops 10559 0 Model B # es 1819 22937 Single Skew T c 9.43 ns 8.05 ns # Departures Checked 3866 24995 Exact Skew T c 9.38 7.96 # Departures Checked 4009 25328
Conclusions Global skews will be too large for GHz + systems Use skew-tolerant circuit techniques such as latches Take advantage of smaller local skews where possible Requires support of timing analyzer Budget appropriate skew at each receiver Track departure times with respect to launching clocks Allow negative departure times with respect to other clocks Leads to explosion in number of timing constraints. However... Most are not tight because most critical paths do not borrow time across many latches Relaxation algorithm automatically prunes loose constraints Very small increase in runtime Expect synchronous systems well beyond 1 GHz