EE141 Fall 2005 Lecture 18 dders nnouncements Hw 6 due Thursday, Nov 3, 5pm No lab this week Midterm 2 Review: Tue Nov 8, North Gate Hall, Room 105, 6:30-8:30pm Exam: Thu Nov 10, Morgan, Room 101, 6:30-8:00pm Samples available at the class web-site EE141 2
Class Material Last Lecture Dynamic Logic Today s Lecture Dual-Rail Domino, np-cmos dders EE141 3 Domino Logic In 1 In 2 M p 1 1 1 0 PDN Out1 0 0 0 1 In 4 M p M kp PDN Out2 In 3 In 5 M e M e Evaluation (conditional discharge) ONLY 0 1 transitions during evaluation! EE141 4
Footless Domino M p M p M p Out 1 0->1 Out 2 0- >1 Out n 0->1 In 1 1- >0 In 2 1->0 In 3 1->0 In n 1->0 The first gate in the chain needs a foot switch Precharge is rippling (next stage has to wait for propagation delay of inverter from the previous stage) Static power consumption EE141 5 Differential (Dual Rail) Domino Out = off on M p M kp M kp M p 1 0 1 0!!!Out =!() M e Solves the problem of non-inverting logic EE141 6
np-cmos In 1 In 2 In 3 M p PDN M e 1 1 1 0 Out1! In 4 In 5! M e PUN M p 0 0 0 1 Out2 (to PDN) Only 0 1 transitions allowed at inputs of PDN Only 1 0 transitions allowed at inputs of PUN EE141 7 NOR Logic In 1 In 2 In 3 M p PDN M e 1 1 1 0 Out1! In 4 In 5! M e PUN M p 0 0 0 1 Out2 (to PDN) to other PDN s to other PUN s WRNING: Very sensitive to noise! P-blocks are slower EE141 8
Choosing a Logic Style: No Style Fits all Needs General design Considerations Robustness (Static CMOS, Ratioed Logic) rea (Pseudo-NMOS, Static CMOS) Speed (Dynamic, Ratioed Logic) Power (Static CMOS, Dynamic Logic) pplication-specific considerations XOR-dominated functions (PTL) Design tool considerations Static CMOS EE141 9 dders video clip
LUs are Thermal Hotspots! Cache Temp ( o C) Processor thermal map Execution core Integer and FP LUs and MCs Courtesy: R. Krishnamurthy (Intel) LUs: performance and peak-current limiters Goal: high-performance energy-efficient design EE141 11 32-it LU rchitecture External operands External operands Mux control 6:1 Mux 6:1 Mux Shift control 5:1 Mux 2:1 Mux dder core O/p Mux Courtesy: R. Krishnamurthy (Intel) Sum Mux control Sign control Loopback bus Multiple LUs clustered together in the execution core High power density EE141 12
Full dder Cin Full adder Sum Cout EE141 13 The inary dder Cin Full adder Sum Cout S = = + + + C o = + + EE141 14
Express Sum and Carry as a Function of P, G, D Define 3 new variables which ONLY depend on, Generate (G) = Propagate (P) = Delete = Can also derive expressions for S and C o based on D and P Note that we will be sometimes using an alternate definition for Propagate (P) = + EE141 15 The Ripple-Carry dder 0 0 1 1 2 2 3 3,0 C o,0 C o,1 C o,2 C o,3 F F F F (=,1 ) S 0 S 1 S 2 S 3 Worst case delay linear with the number of bits t d = O(N) t adder = (N-1)t carry + t sum Goal: Make the fastest possible carry path circuit EE141 16
Complimentary Static CMOS Full dder X S C o 28 Transistors EE141 17 Inversion Property F C o F C o S S EE141 18
Minimize Critical Path by Reducing Inverting Stages Even cell Odd cell 0 0 1 1 2 2 3 3,0 C o,0 C o,1 C o,2 C o,3 F F F F S 0 S 1 S 2 S 3 Exploit Inversion Property EE141 19 etter Structure: The Mirror dder "0"-Propagate Kill C o S "1"-Propagate Generate 24 Transistors EE141 20
Mirror dder Stick Diagram C o C o S GND EE141 21 Manchester Carry Chain P i φ P i C o G i Ci C o G i P i D i φ EE141 22
Manchester Carry Chain φ P 0 P 1 P 2 P 3 C 3,0 G 0 G 1 G 2 G 3 φ C 0 C 1 C 2 C 3 EE141 23 Manchester Carry Chain Stick Diagram Propagate/Generate Row P i G i φ P i + 1 G i + 1 φ - 1 + 1 GND Inverter/Sum Row EE141 24
Domino Manchester Carry Chain 3 3 3 3 3 P 0 P 1 P 2 P 3 4,0 5 G 0 4 3 2 1,4 G 1 3 G 2 2 G 3 1 6 5 4 3 2!(G 0 + P 0,0 )!(G 1 + P 1 G 0 + P 1 P 0,0 ) EE141 25 Carry-ypass dder,0 P 0 G 1 P 0 G 1 P 2 G 2 P 3 G 3 C o,0 C o,1 C o,2 F F F F C o,3 lso called Carry-Skip P 0 G 1 P 0 G 1 P 2 G 2 P 3 G 3 P=P o P 1 P 2 P 3,0 C o,0 C o,1 C o,2 F F F F Multiplexer C o,3 Idea: If (P 0 and P 1 and P 2 and P 3 = 1), then C o,3 = C o, else kill or generate EE141 26
Carry-ypass dder (Cont.) it 0 3 t setup it 4 7 t bypass it 8 11 it 12 15 Carry propagation Carry propagation Carry propagation Carry propagation Sum Sum Sum t sum Sum M bits t adder = t setup + Mt carry + (N/M-1)t bypass + (M-1)t carry + t sum EE141 27 Carry Ripple vs. Carry ypass t p Ripple adder ypass adder 4-8 N EE141 28
Carry-Select dder P,G "0" "0" Carry Propagation "1" "1" Carry Propagation C o,k-1 Multiplexer Co,k+3 Carry Vector Sum Generation EE141 29 Carry Select dder: Critical Path it 0 3 it 4 7 it 8 11 it 12 15 0 0-Carry 0 0-Carry 0 0-Carry 0 0-Carry 1 1-Carry 1 1-Carry 1 1-Carry 1 1-Carry Multiplexer Multiplexer Multiplexer Multiplexer,0 C o,3 C o,7 C o,11 C o,15 Sum Generation Sum Generation Sum Generation Sum Generation S 0 3 S 4 7 S 8 11 S 12 15 EE141 30
Linear Carry Select 0 1 (1) it 0 3 it 4 7 it 8 11 it 12 15 0-Carry 1-Carry (5) (1) 0 0-Carry 0 0-Carry 0 0-Carry 1 1-Carry 1 1-Carry 1 1-Carry (5) (5) (5) (6) (7) (8) Multiplexer Multiplexer Multiplexer Multiplexer,0 C o,3 C o,7 C o,11 C o,15 Sum Generation Sum Generation Sum Generation S 0 3 S 4 7 S 8 11 N t add = tsetup + M tcarry + tmux + t M Sum Generation sum S 12 15 (9) (10) EE141 31 Square Root Carry Select it 0-1 it 2-4 it 5-8 it 9-13 it 14-19 (1) "0" "0" Carry "0" "0" Carry "0" "0" Carry "0" "0" Carry (1) "1" "1" Carry "1" "1" Carry "1" "1" Carry "1" "1" Carry (3) (3) (4) (5) (6) (4) (5) (6) (7) Multiplexer Multiplexer Multiplexer Multiplexer,0 Sum Generation Sum Generation Sum Generation Sum Generation S 0-1 S 2-4 S 5-8 S 9-13 (7) Mux (8) Sum S 14-19 (9) t = t + M t + 2N t + t add setup carry (N/M) EE141 32 mux sum
dder Delays Comparison 50 t p (in unit delays) 40 30 20 10 Ripple adder Linear select Square root select 0 0 20 40 N 60 EE141 33 Carry Look-head Partial Sum Sum i = i i Carry i-1 Carry i = i i + ( i + i ) Carry i-1 Generate Propagate Carry i = G i + P i Carry i-1 EE141 34
Look-head: asic Idea 0, 0 1, 1 N-1, N-1 The idea is to eliminate carry rippling effect,0 P 0,1 P 1, N-1 P N-1 S 0 S 1 S N-1 C ok, = f ( k, k, C ok ) = G k + P k C ok 1, 1, EE141 35 Look-head: Topology Expanding Look-head equations: C ok, = G k + P k ( G k 1 + P k 1 C ok 2 ), G 3 G 2 G 1 Implementation issues: - long stack (N+1) - or multiple stages still linear delay!,0 P 0 P 1 P 2 G 0 C o,3 P 3 ll the way: C ok, = G k + P k ( G k 1 + P k 1 ( + P 1 ( G 0 + P 0 0 ))), EE141 36
Logarithmic Look-head dder 0 F 1 2 3 4 5 6 7 0 1 t p N 2 3 4 5 6 7 F t p log 2 (N) Idea: large stacks limit carry look-ahead to 2-4 bits organize carry P and G into recursive trees EE141 37 Carry Look-head Trees C o, 0 = G 0 + P 0, 0 C o1, = G 1 + P 1 G 0 + P 1 P 0 0, C o2 = G, 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0, 0 = ( G 2 + P 2 G 1 ) + ( P 2 P 1 )( G 0 + P 0 0 ) = G 2:1 + P 2:1 C o 0,, Can continue building the tree hierarchically... EE141 38
High-Performance dders: Kogge-Stone Tree dder Even input bits 1 2 3 4 5 6 7 PG Gen. CM1 CM2 CM3 CM4 CM5 XOR Sum even Odd input bits PG Gen. CM1 CM2 CM3 CM4 CM5 XOR Sum odd GG=G i +P i G i-1 GP=P i P i-1 Courtesy: R. Krishnamurthy (Intel) Generate all 32 carries Full-blown binary tree energy-inefficient # carry-merge stages = log 2 (32) 5 stages EE141 39 Kogge-Stone dder Courtesy: R. Krishnamurthy (Intel) PG 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 XOR Carry-merge gates Critical path = PG + 5 + XOR = 7 gate stages Generate, Propagate FO of 2,3 Energy Maximum interconnect spans 16b inefficient EE141 40
Tree dders ( 0, 0 ) ( 1, 1 ) ( 2, 2 ) ( 3, 3 ) ( 4, 4 ) ( 5, 5 ) ( 6, 6 ) ( 7, 7 ) ( 8, 8 ) ( 9, 9 ) ( 10, 10 ) ( 11, 11 ) ( 12, 12 ) ( 13, 13 ) ( 14, 14 ) ( 15, 15 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 16-bit radix-2 Kogge-Stone tree EE141 41 Example: Domino dder Clk G i = a i b i Clk P i = a i + b i a i a i b i b i Clk Clk Propagate Generate EE141 42
Example: Domino dder The dot operator (carry-merge) Clk k P i:i-2k+1 Clk k G i:i-2k+1 P i:i-k+1 P i:i-k+1 G i:i-k+1 P i-k:i-2k+1 G i-k:i-2k+1 Propagate Generate EE141 43 Example: Domino Sum Keeper Clk Clkd Sum Gi:0 Clk S i 0 Clkd Clk Gi:0 S i 1 Clk EE141 44
Tree dders (a 0, b 0 ) (a 1, b 1 ) (a 2, b 2 ) (a 3, b 3 ) (a 4, b 4 ) (a 5, b 5 ) (a 6, b 6 ) (a 7, b 7 ) (a 8, b 8 ) (a 9, b 9 ) (a 10, b 10 ) (a 11, b 11 ) (a 12, b 12 ) (a 13, b 13 ) (a 14, b 14 ) (a 15, b 15 ) S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 S 11 S 12 S 13 S 14 S 15 16-bit radix-4 Kogge-Stone Tree EE141 45 Sparse-Tree dder rchitecture Generate every 4 th carry in parallel Side-path: 4-bit conditional sum generator 73% fewer carry-merge gates energy-efficient Courtesy: R. Krishnamurthy (Intel) EE141 46
dder Core Critical Path dder Inputs clk PG GG 1 clk2 GG 7 clk3 Single-rail dynamic sparse-tree path Sum 31_0 clk CM0 Latch GG 3 CM1 XOR GG 15 Static sum generator GG 27 Sum 31_1 Courtesy: R. Krishnamurthy (Intel) C 27 Sum 31 Critical path: 7 gates same as KS Sparse-tree: single-rail dynamic Exploit non-criticality of sum generator Convert to static logic semi-dynamic design EE141 47 Sparse-Tree rchitecture Performance impact: 20% speedup 33-50% reduced G/P fanouts 80% reduced wiring complexity 30% reduction in maximum interconnect Power impact: 56% reduction 73% fewer carry-merge gates 50% reduction in average transistor size Courtesy: R. Krishnamurthy (Intel) EE141 48
Energy-Delay Space Worst-case Energy (pj) 100 80 60 40 56% 20% 130nm CMOS, 1.2V, 110 o C Dynamic Kogge-Stone Courtesy: R. Krishnamurthy (Intel) 20 4GHz 0 Design Semi-dynamic Sparse-Tree 140 160 180 200 220 240 260 280 Delay (ps) 20% speedup over Kogge-Stone 56% worst-case energy reduction EE141 49 Next Lecture Multipliers Power EE141 50