Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić Timing Issues January 2003 1
Synchronous Timing CLK In R Combinational 1 R Logic 2 C in C out Out 2
Timing Definitions 3
Latch Parameters D Q Clk Clk D PW m t hold T t su Q t c-q t d-q Delays can be different for rising and falling data transitions 4
Register Parameters D Q Clk Clk T D t hold t su Q t c-q Delays can be different for rising and falling data transitions 5
Clock Uncertainties 4 Power Supply Devices 2 3 Interconnect 6 Capacitive Load 1 Clock Generation 5 Temperature 7 Coupling to Adjacent Lines Sources of clock uncertainty 6
Clock Nonidealities Clock skew Spatial variation in temporally equivalent clock edges; deterministic + random, t SK Clock jitter Temporal variations in consecutive edges of the clock signal; modulation + random noise Cycle-to-cycle (short-term) t JS Long term t JL Variation of the pulse width Important for level sensitive clocking 7
Clock Skew and Jitter Clk t SK Clk t JS Both skew and jitter affect the effective cycle time Only skew affects the race margin 8
Clock Skew # of registers Earliest occurrence of Clk edge Nominal δ/2 Latest occurrence of Clk edge Nominal + δ /2 Insertion delay Max Clk skew Clk delay δ 9
Positive and Negative Skew In R1 D Q Combinational Logic R2 D Q Combinational Logic R3 D Q CLK t CLK1 t CLK2 t CLK3 delay (a) Positive skew delay In R1 D Q Combinational Logic R2 D Q Combinational Logic R3 D Q t CLK1 t CLK2 t CLK3 delay delay CLK (b) Negative skew 10
Positive Skew T CLK + δ CLK1 1 δ T CLK 3 CLK2 2 4 δ + t h Launching edge arrives before the receiving edge 11
Negative Skew T CLK + δ CLK1 1 T CLK 3 CLK2 2 δ 4 Receiving edge arrives before the launching edge 12
Timing Constraints In R1 D Q Combinational Logic R2 D Q CLK t CLK1 t CLK2 t c q t c q, cd t su, t hold t logic t logic, cd Minimum cycle time: T - δ = t c-q + t su + t logic Worst case is when receiving edge arrives early (positive δ) 13
Timing Constraints In R1 D Q Combinational Logic R2 D Q CLK t CLK1 t CLK2 t c q t c q, cd t su, t hold t logic t logic, cd Hold time constraint: t (c-q, cd) + t (logic, cd) > t hold + δ Worst case is when receiving edge arrives late Race between data and clock 14
Impact of Jitter 2 T CLK 5 CLK 1 3 4 -t ji tte r t jitter 6 In REGS CLK t c-q, t c-q, cd t su, t hold t jitter Combinational Logic t logic t logic, cd 15
Longest Logic Path in Edge-Triggered Systems Clk T Clk-Q T T LM T SU T JI + δ Latest point of launching Earliest arrival of next cycle 16
Clock Constraints in Edge-Triggered Systems If launching edge is late and receiving edge is early, the data will not be too late if: T c-q + T LM + T SU < T T JI,1 T JI,2 - δ Minimum cycle time is determined by the maximum delays through the logic T c-q + T LM + T SU + δ + 2 T JI < T Skew can be either positive or negative 17
Shortest Path Earliest point of launching Clk T Clk-Q T Lm Clk T H Nominal clock edge Data must not arrive before this time 18
Clock Constraints in Edge-Triggered Systems If launching edge is early and receiving edge is late: T c-q + T LM T JI,1 < T H + T JI,2 + δ Minimum logic delay T c-q + T LM < T H + 2T JI + δ 19
How to counter Clock Skew? Negative Skew REG φ REG. REG log Out In REG φ φ Positive Skew φ Clock Distribution Data and Clock Routing 20
Flip-Flop Flop Based Timing φ Logic delay Skew Flip-flop delay Flip -flop Logic T SU φ = 0 T Clk-Q φ = 1 Representation after M. Horowitz, VLSI Circuits 1996. 21
Flip-Flops Flops and Dynamic Logic Logic delay T SU T SU T Clk-Q φ = 0 T Clk-Q φ = 1 φ = 0 φ = 1 Logic delay Precharge Evaluate Evaluate Precharge Flip-flops are used only with static logic 22
Latch timing t D-Q D Q When data arrives to transparent latch Latch is a soft barrier Clk t Clk-Q When data arrives to closed latch Data has to be re-launched 23
Single-Phase Clock with Latches φ Latch Logic T skl T skl T skt T skt Clk PW P 24
Latch-Based Design L1 latch is transparent when φ = 0 φ L2 latch is transparent when φ = 1 L1 Latch Logic L2 Latch Logic 25
Slack-borrowing In L1 L2 L1 D Q CLB_A CLB_B D Q D Q a b c d e t pd,a t pd,b CLK1 CLK2 CLK1 T CLK CLK1 1 2 3 4 CLK2 slack passed to next stage t pd,a t DQ t pd,b t DQ a valid b valid c valid e valid d valid 26
Latch-Based Timing φ Static logic Skew L1 Latch Logic L2 Latch φ = 1 L2 latch L1 latch Logic Can tolerate skew! Long path φ = 0 Short path 27
Clock Distribution H-tree CLK Clock is distributed in a tree-like fashion 28
More realistic H-treeH [Restle98] 29
The Grid System Driver GCLK GCLK Driver Driver GCLK No rc-matching Large power Driver GCLK 30
Example: DEC Alpha 21164 Clock Frequency: 300 MHz - 9.3 Million Transistors Total Clock Load: 3.75 nf Power in Clock Distribution network : 20 W (out of 50) Uses Two Level Clock Distribution: Single 6-stage driver at center of chip Secondary buffers drive left and right side clock grid in Metal3 and Metal4 Total driver size: 58 cm! 31
21164 Clocking t rise = 0.35ns t cycle = 3.3ns Clock waveform final drivers pre-driver Location of clock driver on die t skew = 150ps 2 phase single wire clock, distributed globally 2 distributed driver channels Reduced RC delay/skew Improved thermal distribution 3.75nF clock load 58 cm final driver width Local inverters for latching Conditional clocks in caches to reduce power More complex race checking Device variation 32
Clock Drivers 33
Clock Skew in Alpha Processor 34
EV6 (Alpha 21264) Clocking 600 MHz 0.35 micron CMOS t cycle = 1.67ns t rise = 0.35ns Global clock waveform t skew = 50ps PLL 2 Phase, with multiple conditional buffered clocks 2.8 nf clock load 40 cm final driver width Local clocks can be gated off to save power Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking 35
21264 Clocking 36
ps 5 10 15 20 25 30 35 40 45 50 EV6 Clock Results ps 300 305 310 315 320 325 330 335 340 345 GCLK Skew (at Vdd/2 Crossings) GCLK Rise Times (20% to 80% Extrapolated to 0% to 100%) 37
EV7 Clock Hierarchy Active Skew Management and Multiple Clock Domains NCLK (Mem Ctrl) + widely dispersed drivers DLL DLL DLL + DLLs compensate static and lowfrequency variation + divides design and verification effort L2L_CLK (L2 Cache) GCLK (CPU Core) PLL L2R_CLK (L2 Cache) - DLL design and verification is added work SYSCLK + tailored clocks 38
Self-timed and Asynchronous Design Functions of clock in synchronous design 1) Acts as completion signal 2) Ensures the correct ordering of events Truly asynchronous design 1) Completion is ensured by careful timing analysis 2) Ordering of events is implicit in logic Self-timed design 1) Completion ensured by completion signal 2) Ordering imposed by handshaking protocol 39
Synchronous Pipelined Datapath In D R1 Q Logic Block #1 D R2 Q Logic Block #2 D R3 Q Logic Block #3 D R4 Q CLK t pd,reg t pd1 t pd2 t pd3 40
Self-Timed Pipelined Datapath Req Req Req Req Ack HS Ack HS Ack HS ACK Start Done Start Done Start Done In R1 F1 R2 F2 R3 F3 Out t pf1 t pf2 t pf3 41
Completion Signal Generation In LOGIC NETWORK Out Start DELAY MODULE Done Using Delay Element (e.g. in memories) 42
Completion Signal Generation Using Redundant Signal Encoding 43
Completion Signal in DCVSL V DD V DD Start B0 B1 Done B0 B1 In1 In1 In2 In2 PDN PDN Start 44
Self-Timed Adder Start P 0 P 1 V DD P 2 P 3 C 0 C 1 C 2 C 3 C 4 C 4 Start C 4 V DD Done C 4 C 0 G 0 G 1 G 2 G 3 C 3 C 3 Start C 2 C 2 C 1 C 1 Start P 0 P 1 V DD P 2 P 3 C 0 C 1 C 2 C 3 C 4 C 4 Start (b) Completion signal C 0 K 0 K 1 K 2 K 3 Start (a) Differential carry generation 45
Completion Signal Using Current Sensing Inputs Start Input Register V DD Static CMOS Logic GND sense Current Sensor A Output Start A B t delay t overlap Min Delay Generator B Done Done Output t MDG t pd-nor valid 46
Hand-Shaking Protocol Req Ack Req 2 SENDER Data RECEIVER Ack 3 (a) Sender-receiver configuration Data 1 1 Two Phase Handshake cycle 1 cycle 2 Sender s action Receiver s action (b) Timing diagram 47
Event Logic The Muller-C C Element A B C F A B F n+1 0 0 1 1 0 1 0 1 0 F n F n 1 (a) Schematic (b) Truth table V DD V DD V DD A B S R Q F A B B B F A F (a) Logic A B B (b) Majority Function (c) Dynamic 48
2-Phase Handshake Protocol Sender logic Data ready Data Receiver logic Data accepted C Req Ack Handshake logic Advantage : FAST - minimal # of signaling events (important for global interconnect) Disadvantage : edge - sensitive, has state 49
Example: Self-timed FIFO In R1 R2 R3 Out En Done Req i C C C Req 0 Ack i Ack o All 1s or 0s -> pipeline empty Alternating 1s and 0s -> pipeline full 50
2-Phase Protocol 51
Example From [Horowitz] 52
Example 53
Example 54
Example 55
4-Phase Handshake Protocol Req 2 4 Sender s action Receiver s action Ack 3 5 Data 1 1 Cycle 1 Cycle 2 Also known as RTZ Slower, but unambiguous 56
4-Phase Handshake Protocol Implementation using Muller-C elements Sender logic Data Receiver logic Data ready Data accepted C C S Req Ack Handshake logic 57
Self-Resetting Logic Precharged Logic Block (L1) completion detection (L1) Precharged Logic Block (L2) completion detection (L2) Precharged Logic Block (L3) completion detection (L3) V DD int out Post-charge logic A B C 58
Clock-Delayed Domino GND CLK1 CLK2 (to next stage) V DD Q1 (also D2) D1 Pulldown Network 59
Asynchronous-Synchronous Interface Asynchronous system f in Synchronous system f CLK Synchronization 60
Synchronizers and Arbiters Arbiter: Circuit to decide which of 2 events occurred first Synchronizer: Arbiter with clock φ as one of the inputs Problem: Circuit HAS to make a decision in limited time - which decision is not important Caveat: It is impossible to ensure correct operation But, we can decrease the error probability at the expense of delay 61
A Simple Synchronizer CLK D int I 1 Q CLK I 2 Data sampled on rising edge of the clock Latch will eventually resolve the signal value, but... this might take infinite time! 62
Synchronizer: Output Trajectories 2.0 V out 1.0 0.0 0 100 200 300 time [ps] Single-pole model for a flip-flop 63
Mean Time to Failure 64
Example T f = 10 nsec = T T signal = 50 nsec t r = 1 nsec t = 310 psec V IH - V IL = 1 V (V DD = 5 V) N(T) = 3.9 10-9 errors/sec MTF (T) = 2.6 10 8 sec = 8.3 years MTF (0) = 2.5 μsec 65
Influence of Noise p(v) Uniform distribution around VM T logarithmic reduction 0 V IL V IH Initial Distribution Still Uniform Low amplitude noise does not influence synchronization behavior 66
Typical Synchronizers 2 phase clocking circuit φ2 Q φ1 Q φ2 φ1 Using delay line 67
Cascaded Synchronizers Reduce MTF In O 1 O 2 Out Sync Sync Sync φ 68
Arbiters Req1 Req2 Arbiter Ack1 Ack2 Req1 A B Ack2 (a) Schematic symbol Req2 Ack1 Req1 (b) Implementation Req2 V A T gap B metastable Ack1 t (c) Timing diagram 69
PLL-Based Synchronization Digital System Chip 1 Data Chip 2 Digital System f system = N x f crystal PLL Divider reference clock PLL Clock Buffer Crystal Oscillator f crystal, 200<Mhz 70
PLL Block Diagram Reference clock Phase detector Up Charge pump Loop filter v cont VCO Local clock Down Divide by N System Clock 71
Phase Detector Output before filtering ref local clock (a) Output ref local clock Output Output (Low pass filtered) (b) V DD Transfer characteristic -180-90 90 180 phase error (deg) (c) 72
Phase-Frequency Detector Rst D Q UP B B A A Rst D Q DN B (a) schematic UP = 0 DN = 1 A UP = 0 DN = 0 B (b) state transition diagram UP = 1 DN = 0 A A A B UP DN B UP DN (c) Timing waveforms 73
PFD Response to Frequency A B UP DN 74
PFD Phase Transfer Characteristic Average (UP-DN) V DD 2 π 2π phase error (deg) 75
Charge Pump V DD UP To VCO Control Input DN 76
PLL Simulation Control Voltage (V) 1.0 0.8 0.6 0.4 0.2 ref div vco ref 0.0 0 1 2 3 4 5 Time ( μ s) div vco 77
Clock Generation using DLLs Delay-Locked Loop (Delay Line Based) f REF Phase Det U D Charge Pump Filter DL f O Phase-Locked Loop (VCO-Based) f REF U N PD D CP VCO Filter f O 78
Delay Locked Loop F REFΔPH Phase detect U D Charge pump C V CTRL VCDL (a) F O REF OUT UP DN (b) ΔPH V CTRL Delay (c) 79
DLL-Based Clock Distribution VCDL Digital Circuit CP/LF Phase Detector GLOBAL CLK VCDL Digital Circuit CP/LF Phase Detector 80