Lecture 23 Dealing with Interconnect Impact of Interconnect Parasitics Reduce Reliability Affect Performance Classes of Parasitics Capacitive Resistive Inductive 1
INTERCONNECT Dealing with Capacitance Capacitive Crosstalk Dynamic Node CLK C XY Y In 1 In 2 In 3 PDN C Y X 2.5 V CLK 0 V 3 x 1 µm overlap: 0.19 V disturbance 2
Capacitive Crosstalk Driven Node X 0.5 0.45 0.4 0.35 R C τ Y XY = R Y (C XY +C Y ) XY V 0.3 X Y t r C 0.25 Y V (Volt) 0.2 0.15 0.1 0.05 Keep time-constant smaller than rise time 0 0 0.2 0.4 0.6 0.8 1 t(sec) x 10-9 Delay Degradation C c - Impact of neighboring signal activity on switching delay - When neighboring lines switch in opposite direction of victim line, delay increases Miller Effect - Both terminals of capacitor are switched in opposite directions (0 V dd,v dd 0) - Effective voltage is doubled and additional charge is needed (from Q=CV) 3
Interconnect Projections Low-k dielectrics Both delay and power are reduced by dropping interconnect capacitance Types of low-k materials include: inorganic (SiO 2 ), organic (Polyimides) and aerogels (ultra low-k) The numbers below are on the conservative side of the NRTS roadmap ε Generation 0.25 µm 0.18 µm 0.13 µm 0.1 µm 0.07 µm 0.05 µm Dielectric Constant 3.3 2.7 2.3 2.0 1.8 1.5 How to Battle Capacitive Crosstalk Shielding wire GND GND Shielding layer Avoid large crosstalk cap s Avoid floating nodes Isolate sensitive nodes Control rise/fall times Shield! Differential signalling Substrate (GND) 4
Structured and Predictable Interconnect V S G S V S S S V G S V Example: Dense Wire Fabric (DWF) [Khatri, DAC99] Trade-off: Cross-coupling capacitance 40x lower, 2% delay variation Increase in area and overall capacitance Driving Large Capacitances t phl = C L V swing /2 I av V in C L V out Transistor Sizing 5
Using Cascaded Buffers In 1 u u 2 u N-1 Out C i C 1 C 2 C L If self-loading ignored u opt = e t p in function of u and x 60.0 u/ln(u) 40.0 x=10,000 x=1000 20.0 x=100 x=10 0.0 1.0 3.0 5.0 7.0 u 6
How to Design Large Transistors D(rain) Multiple Contacts S(ource) G(ate) small transistors in parallel Bonding Pad Design Bonding Pad GND 100 µm Out In GND Out 7
Reducing the swing t phl = C L V swing /2 I av Reducing the swing potentially yields linear reduction in delay Also results in reduction in power dissipation Requires use of sense amplifier to restore signal level Charge Redistribution Amplifier 5.0 M2 M3 V ref V A C A M1 V B C B V 4.0 3.0 2.0 V in V A V B (a) 1.0 V ref = 3V 0.0 0.0 1.00 2.00 3.00 time (nsec) 8
Precharged Bus In1.f f M1 M2 In2.f Bus C bus M4 M3 Out C out 5.0 f 3.0 V 1.0 V asym V sym V bus -1.0 0 5 10 t (nsec) C bus =1pF Tristate Buffers En In En Out Out En In En 9
INTERCONNECT Dealing with Resistance RI Introduced Noise I φ pre R - V X V I R V 10
Power and Ground Distribution GND Logic Logic GND (a) Finger-shaped network GND (b) Network with multiple supply pins Resistance and the Power Distribution Problem Before After Requires fast and accurate peak current prediction Heavily influenced by packaging technology Source: Simplex 11
Electromigration (1) Limits dc-current to 1 ma/µm Electromigration (2) 12
The Impact of Resistivity T r The distributed rc-line R 1 R 2 R N-1 R N C 1 C 2 C N-1 C N V in 2.5 2.5 Diffused signal propagation Delay ~ L 2 voltage (V) voltage (V) 2 2 1.5 1.5 1 1 0.5 0.5 x= L/10 x= L/10 x = L/4 x = L/4 x = L/2 x = L/2 x= L x= L 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1.5 2.5 3.5 4.5 0 1 2 3 4 5 time (nsec) time (nsec) d The Global Wire Problem w w ( R C + R C R C ) T = 0.377R C + 0.693 + Challenges No further improvements to be expected after the introduction of Copper (superconducting, optical?) Design solutions» Use of fat wires» Insert repeaters but might become prohibitive (power, area)» Efficient chip floorplanning Towards communication-based design» How to deal with latency?» Is synchronicity an absolute necessity? d out d w w out 13
Reducing RC-delay Repeater Architecture Must Evolve to Fit the Landscape Global operations Low bandwidth High latency & High power 20 Clocks 90,000 tracks Local, parallel operations High bandwidth Low latency & Low power Source: Bill Dally, Stanford 14
Interconnect: # of Wiring Layers # of metal layers is steadily increasing due to: Tins ρ = 2.2 µω-cm M6 Increasing die size and device count: we need more wires and longer wires to connect everything W S M5 Rising need for a hierarchical wiring network; local wires with high density and global wires with low RC M4 H M3 M2 M1 poly substrate 0.25 µm wiring stack 3.5 Minimum Widths (Relative) 4.0 Minimum Spacing (Relative) 3.0 3.5 2.5 3.0 2.5 2.0 M5 1.5 M4 2.0 M3 1.5 1.0 M2 M1 1.0 0.5 Poly 0.5 0.0 0.0 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ 1.0µ 0.8µ 0.6µ 0.35µ 0.25µ M5 M4 M3 M2 M1 Poly Interconnect Projections: Copper Copper is planned in full sub-0.25 µm process flows and large-scale designs (IBM, Motorola, IEDM97) With cladding and other effects, Cu ~ 2.2 µω-cm vs. 3.5 for Al(Cu) 40% reduction in resistance Electromigration improvement; 100X longer lifetime (IBM, IEDM97)» Electromigration is a limiting factor beyond 0.18 µm if Al is used (HP, IEDM95) Vias 15
Clock Distribution CLOCK H-Tree Network Observe: Only Relative Skew is Important Clock Network with Distributed Buffering Local Area Module Module secondary clock drivers Module Module Module Module main clock driver CLOCK Reduces absolute delay, and makes Power-Down easier Sensitive to variations in Buffer Delay 16
Example: DEC Alpha 21164 Clock Frequency: 300 MHz - 9.3 Million Transistors Total Clock Load: 3.75 nf Power in Clock Distribution network : 20 W (out of 50) Uses Two Level Clock Distribution: Single 6-stage driver at center of chip Secondary buffers drive left and right side clock grid in Metal3 and Metal4 Total driver size: 58 cm! 21164 Clocking t rise = 0.35ns t cycle = 3.3ns Clock waveform final drivers pre-driver Location of clock driver on die t skew = 150ps 2 phase single wire clock, distributed globally 2 distributed driver channels» Reduced RC delay/skew» Improved thermal distribution» 3.75nF clock load» 58 cm final driver width Local inverters for latching Conditional clocks in caches to reduce power More complex race checking Device variation 17
Clock Drivers Clock Skew in Alpha Processor 18
EV6 Clocking t cycle = 1.67ns t rise = 0.35ns Global clock waveform t skew = 50ps PLL 2 Phase, with multiple conditional buffered clocks» 2.8 nf clock load» 40 cm final driver width Local clocks can be gated off to save power Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking EV6 Clock Results ps 5 10 15 20 25 30 35 40 45 50 ps 300 305 310 315 320 325 330 335 340 345 GCLK Skew (at Vdd/2 Crossings) GCLK Rise Times (20% to 80% Extrapolated to 0% to 100%) 19
EV7 Clock Hierarchy Active Skew Management and Multiple Clock Domains NCLK (Mem Ctrl) + widely dispersed drivers DLL DLL DLL + DLLs compensate static and lowfrequency variation + divides design and verification effort L2L_CLK (L2 Cache) GCLK (CPU Core) PLL L2R_CLK (L2 Cache) - DLL design and verification is added work SYSCLK + tailored clocks 20