VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

VLSI Design Adder Design [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Major Components of a Computer Processor Devices Control Memory Input Datapath Output ECE 4121 VLSI DEsign.2

A Generic Digital Processor MEMORY INPUT-O OUTPUT CONTROL DATAPATH ECE 4121 VLSI DEsign.3

Basic Building Blocks Datapath Execution units - Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers ECE 4121 VLSI DEsign.4

Bit-Sliced Design Control Bit 3 Dat ta-in ster Regi Add der Shift ter Multip plexer Bit 2 Bit 1 Bit 0 Data a-out Tile identical processing elements ECE 4121 VLSI DEsign.5

Bit-Sliced Design Control Bit 3 Dat ta-in ster Regi Add der Shift ter Multip plexer Bit 2 Bit 1 Bit 0 Data a-out Tile identical processing elements ECE 4121 VLSI DEsign.6

The 1-bit Binary Adder A B C in A B C in C out S carry status 0 0 0 0 0 kill 1-bit Full Adder S 0 0 0 1 1 0 0 0 1 1 kill propagate (FA) 0 1 1 1 0 propagate 1 0 0 0 1 propagate C out 1 0 1 1 0 propagate p 1 1 0 1 0 generate 1 1 1 1 1 generate G = AB P = A B S=A B C in =P C in K =!A!B C out = AB + AC in + BC in (majority function) = G + PC in How can we use it to build a 64-bit adder? How can we modify it easily to build an adder/subtractor? How can we make it better (faster, lower power, smaller)? ECE 4121 VLSI DEsign.7

One-Bit Full Adder: Share Logic An observation Almost always, C in A B Sum Cout sum = NOT carry 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 includes 111 0 1 1 0 1 1 0 0 1 0 Sum = ABCin+ A.B.Cin 1 0 1 0 1 (A+B+Cin).Cout 1 1 0 0 1 1 1 1 1 1 excludes 000 ECE 4121 VLSI DEsign.8

FA Gate Level Implementations A B C in A B C in t1 t0 t2 t2 t1 t0 C out S C out S ECE 4121 VLSI DEsign.9

Ripple Carry Adder (RCA) A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C out =C 4 FA FA FA FA C 0 =C in S 3 S 2 S 1 S 0 T adder T FA (A,B C out ) + (N-2)T FA (C in C out ) + T FA (C in S) t ( adder N 1 )t carry + t sum T = O(N) worst case delay Real Goal: Make the fastest possible carry path ECE 4121 VLSI DEsign.10

Complimentary Static CMOS Full Adder V DD A B V DD B C i A B A A C i B X C i C i A V DD S A B B V DD A B C i C i A C o B C out = AB + BC in + AC in SUM = ABC +!C (A+B+C = AB + Cin(B + A) in OUT + in ) 28 Transistors ECE 4121 VLSI DEsign.11

Inversion Property Inverting all inputs to a FA results in inverted values for all outputs A B A B C out FA C in C out FA C in S S!S (A, B, C in ) = S(!A,!B,!C in )!C out (A,B,C in ) = C out (!A,!B,!C in ) ECE 4121 VLSI DEsign.12

One-Bit Full Adder: Inverted Inputs An observation Invert inputs => outputs invert FA Exploit this property: FA Get rid of the inverter on the carry critical path C in A B Sum Cout 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 ECE 4121 VLSI DEsign.13

Exploiting the Inversion Property A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C out =C 4 FA FA FA FA C 0 =C in S 3 S 2 S 1 S 0 inverted cell regular cell Minimizes the critical path (the carry chain) by eliminating inverters between the FAs (will need to increase the transistor sizing on the carry chain portion of the mirror adder). ECE 4121 VLSI DEsign.14

Ripple Carry Adder: Inverting Property A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C 4 C 3 FA C 2 FA C 1 FA C 0... FA S S S S 3 2 1 0 FA is similar to FA, but with no inverters on the outputs Much faster (1-stage) Disadvantage: not regular data path ECE 4121 VLSI DEsign.15

Mirror Adder 24+4 transistors B 6 A 8 B 8 B 8 A 4 B 4 C in 4 0-propagate kill A 6 8 A 8 4 C in 6!C C out in!s 1-propagate 4 A 4 generate 2 C in 3 A 4 B 4 B 4 A 2 B 2 C in 2 A 3 B 3 C out =AB+BC BC in +AC in SUM = ABC in +!C OUT (A+B+C + in ) = AB + Cin(B + A) Sizing: Each input in the carry circuit has a logical effort of 2 so the optimal fan-out for each is also 2. Since!C out drives 2 internal and 2 inverter transistor gates (to form C in for the nms bit adder) should oversize the carry circuit. PMOS/NMOS ratio of 2. ECE 4121 VLSI DEsign.16

Mirror Adder Features The NMOS and PMOS chains are completely symmetrical with a maximum of two series transistors in the carry circuitry, guaranteeing identical rise and fall transitions if the NMOS and PMOS devices are properly sized. When laying out the cell, the most critical issue is the minimization of the capacitances at node!c out (four diffusion capacitances, two internal gate capacitances, and two inverter gate capacitances). Shared diffusions can reduce the stack node capacitances. The transistors connected to C in are placed closest to the output. Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size. ECE 4121 VLSI DEsign.17

A 64-bit Adder/Subtractor Ripple Carry Adder (RCA) built out of 64 FAs Subtraction complement all subtrahend bits (xor gates) and set the low order carry-in RCA advantage: simple logic, so small (low cost) add/subt B 0 B 1 B 2 disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption) B 63 A 0 A1 A 2 A 63 C 0 =C in 1-bit FA S 0 C 1 1-bit FA S 1 C 2 1-bit FA S 2... C 3 C 63 1-bit FA S 63 C 64 =C out ECE 4121 VLSI DEsign.18

Carry-Lookahead Adder: Idea New look: carry propagation Idea: Try to predict C k earlier than T c *k Instead of passing through k stages, compute C k separately using 1-stage CMOS logic Carry propagation: an example Bit position Carry A B 7 6 5 4 3 2 1 0 1 0 0 1 1 1 1 0 1 0 0 1 1 0 1 + 0 1 0 0 0 1 1 1 Sum 1 0 0 1 0 1 0 0 ECE 4121 VLSI DEsign.19

Carry-Lookahead Adder (CLA): One Bit What happens to the propagating carry in bit position k? 0-propagate A B A C A B ECE 4121 VLSI DEsign.20 kill A B C in Cout 0 0-0 (kill) 0 1 C C(propagate) 1 0 C C(propagate) 1 1-1(generate) B Cout B p=a+b (or A B) A g = A.B 1-propagate generate [Rab96] p391

CLA: Propagation Equations If C 4 =1, then either: g 3 generated at bit pos 3 g 2.p 3 generated at bit pos 2, propagated 3 g 1.p 2.p 3 generated at bit pos 1, propagated 2,3 g 0.p 1.p 2.p 3 generated at bit pos 0, propagated 1,2,3 C in.p 0.p 1.p 2.p 3 input carry, propagated 0,1,2,3 C 4 = g 3 + g 2.p 3 + g 1.p 2.p 3 + g 0.p 1.p 2.p 3 + C in.p 0.p 1.p 2.p 3 Implement C 4 as a one-stage CMOS logic delay=1 (or is it?) ECE 4121 VLSI DEsign.21

CLA: Static Logic Implementation d o e q f r h s p 1.g 2.g 3 C 4 g 3 g g 2 g 1 g 0 j C in p 0 p 1 p 2 p 3 t u v w x k l m p 3.g 2 C 4 n C 4 ECE 4121 VLSI DEsign.22

CLA: Dynamic Logic Implementation Dynamic gate implementation: C 4 = g 3 + p 3. (g 2 + p 2. (g 1 + p 1. (g 0 + P 0.C in ))) 6 transistors in series p 3 p 3 g 3 p 2 g 2 φ C 4 p p 0 p 1 g 0 g 1 C in φ ECE 4121 VLSI DEsign.23 [WE92] p529 [ Hauck]

C 1? CLA: Dynamic Logic Implementation Can we reuse logic? Can we get C 1, C 2 and C 3 from the same circuit? φ C 3? p 3 g C 3 2? g 2 p 2 No! p 1 g 1 C1, C2 and C3 may be floating p 0 g 0 (not precharged) C in φ C 4 Charge sharing problem ECE 4121 VLSI DEsign.24 [ Hauck]

CLA: Dynamic Logic Implementation φ g 0 C 1 p 0 p 1 g 1 p 0 φ C 2 p 0 C in g 0 φ C in φ p 2 p 1 g 1 φ g 2 C 3 p 2 p 3 g 2 g 3 p 1 g 1 φ C 4 p 0 g 0 C in φ C in φ p 0 g 0 [WE92] p529 ECE 4121 VLSI DEsign.25

CLA: Basic Block (4 Bits) Architecture Block of 4-bit p, g, C out A B 3 A B 2 A B 1 A B 0 3 2 1 0 p,g p,g p,g p,g p 3 g 3 p 2 g 2 p 1 g 1 p 0 g 0 C C 4 0 C 3 C 2 C 1 S S S S 3 ECE 4121 VLSI DEsign.26 2 1 0

CLA: N-Bit Architecture Put it all together: A 7 B 7 A 6 B 6 A 5 B 5 A 4 B 4 A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 p,g p,g p,g p,g p,g p,g p,g p,g Carry Generator Carry Generator C 0 C 8 C 4 S 7 S 6 S 5 S 4 S 3 S 2 S 1 S 0 ECE 4121 VLSI DEsign.27

CLA: 12-Bit Example A= 1101 1001 1010 B= 0111 0110 1101 A 11 A 10 A 9 A 8 A 7 A 6 A 5 A 4 A 3 A 2 A 1 A 0 B 11 B 10 B 9 B 8 B 7 B 6 B 5 B 4 B 3 B 2 B 1 B 0 p,g p,g p,g p,g p,g p,g p,g p,g p,g p,g p,g p,g 0 Carry Generator Carry Generator Carry Generator C 0 C 12 S 11 S 10 S 9 S 8 C 8 C 4 S 7 T=0 0 0000 0 0000 0 0000 T=2 T=3 T=4 S 6 1 0100 0 1111 1 0111 1 0100 1 0000 1 0111 1 0101 1 0000 1 0111 S 5 S 4 S 3 S 2 S 1 S 0 ECE 4121 VLSI DEsign.28

Summary: Carry Lookahead Adder CLA compared to ripple-carry adder: Faster ( 4 times?), but delay still linear (w.r.t. # of bits) Larger area - P, G signal generation - Carry ygeneration circuits - Carry generation ckt for each bit position (no re-use) Limitation: cannot go beyond 4 bits of look-ahead Large p,g fan-out slows down carry generation Next: Manchester carry chains Tries to reuse logic by pre-charging each carry position ECE 4121 VLSI DEsign.29

Recap: Carry Look-Ahead Charge sharing problem C 1? φ C 3? p 3 g C 3 2? p 2 g 2 p p 0 p 1 g 0 g 1 C 4 C in φ ECE 4121 VLSI DEsign.30

Fast Carry Chain Design The key to fast addition is a low latency carry network What matters is whether in a given position a carry is generated G i = A i & B i = A i B i propagated P i = A i B i (sometimes use A i B i ) annihilated (killed) K i =!A i &!B i Giving a carry recurrence of C i+1 =G i PC i i C 1 = C 2 = C 3 = C 4 = ECE 4121 VLSI DEsign.31

Manchester Carry Chain Switches controlled by G i and P i!c i+1 G i P!C i i P i clk Total delay of time to form the switch control signals G i and P i setup time for the switches signal propagation delay through N switches in the worst case ECE 4121 VLSI DEsign.33

Domino Manchester Carry Chain Circuit 3 P 3 P 3 P 3 P 3 P 3 C 1 2 3 4 i,4 1 G G 2 G 1 G 3 2 3 4 0 5 P 2 P 1 P 0 clk C i,0 2 3 4 5 6 clk!(g 2 P 2 G 1 P 2 P 1 G 0 P 2 P 1 P 0 C i,0 )!(G 0 P 0 C i,0 )!(G 1 P 1 G 0 P 1 P 0 C i,0 )!(G 3 P 3 G 2 P 3 P 2 G 1 P 3 P 2 P 1 G 0 P 3 P 2 P 1 P 0 C i,0 ) ECE 4121 VLSI DEsign.34

Carry-Skip (Carry-Bypass) Adder A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C o,3 FA FA FA FA C i,0 C o,3 S 3 S 2 S 1 S 0 BP = P 0 P 1 P 2 P 3 Block Propagate If (P 0 & P 1 & P 2 & P 3 = 1) then C o,3 = C i,0 otherwise the block itself kills or generates the carry internally ECE 4121 VLSI DEsign.35

Carry-Skip Chain Implementation carry-out block carry-out BP block carry-in P 3 P 2 P 1 P 0!C out C in G 3 G 2 G 1 G 0 BP ECE 4121 VLSI DEsign.36

4-bit Block Carry-Skip Adder bits 12 to 15 bits8to11 to bits4to7 to bits0to3 to Setup Setup Setup Setup Carry Propagation Carry Propagation Carry Propagation Carry Propagation C i,0 Sum Sum Sum Sum Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15 T add = t setup + B t carry + ((N/B) -1) t skip +B t carry + t sum ECE 4121 VLSI DEsign.37

Optimal Block Size and Time Assuming one stage of ripple (t carry ) has the same delay as one skip logic stage (t skip ) and both are 1 T CSkA = 1 + B + (N/B-1) + B + 1 t setup ripple in skips ripple in t sum block 0 last block = 2B + N/B + 1 So the optimal block size, B, is And the optimal time is dt CSkA /db = 0 (N/2) = B opt Optimal T CSkA = 2( (2N)) + 1 ECE 4121 VLSI DEsign.38

Carry-Skip Adder Extensions Variable block sizes A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks, so can have bigger blocks for the inner carries without increasing the overall delay C out C in Multiple levels of skip logic C out C in skip level 1 skip level 2 AND of the first level l skip signals (BP s) ECE 4121 VLSI DEsign.39

Carry-Skip Adder Comparisons 70 60 50 40 RCA 30 20 10 B=6 B=5 B=4 B=2 B=3 CSkA VSkA 0 8bits 16 bits 32 bits 48 bits 64 bits ECE 4121 VLSI DEsign.40

Carry Select Adder A s B s 4-b Setup Precompute the carry out of each block for P s G s both carry_in = 0 and 0 carry ypropagationp 0 carry_in = 1 (can be done for all blocks in parallel) and then select the correct one 1 carry ypropagationp 1 multiplexer C out C s Sum generation C in ECE 4121 VLSI DEsign.41 S s

Carry Select Adder: Critical Path bits 12 to 15 A s B s bits 8 to 1 A s B s bits 4 to 7 A s B s bits 0 to 3 A s B s Setup P s G s Setup P s G s Setup P s G s Setup P s G s 0 carry 0 carry 0 carry 0 carry 0 1 carry 1 carry 1 carry 1 carry 1 C out mux C s mux C s mux C s mux C s C in Sum gen Sum gen Sum gen Sum gen S s S s S s S s ECE 4121 VLSI DEsign.42

Carry Select Adder: Critical Path bits 12 to 15 A s B s bits 8 to 1 A s B s bits 4 to 7 A s B s bits 0 to 3 A s B s Setup P s G s Setup P s G s Setup P s G s 1 Setup P s G s 0 carry 0 carry 0 carry 0 carry 0 +4 1 carry 1 carry 1 carry 1 carry 1 C out mux +1 C s mux +1 C s mux +1 C s mux +1 C s C in Sum +1gen Sum gen Sum gen Sum gen S s S s S s S s T add = t setup + B t carry + N/B t mux + t sum ECE 4121 VLSI DEsign.43

Square Root Carry Select Adder bits 14 to 19 A s B s bits 9 to 13 A s B s bits 5 to 8 A s B s bits2to4 to A s B s bits0to1 to A s B s Setup P s G s Setup P s G s Setup P s G s Setup P s G s Setup P sg s 0 carry 0 carry 0 carry 0 carry 0 carry 0 1 carry 1 carry 1 carry 1 carry 1 carry 1 C out mux C s mux C s mux C s mux C s mux C s Cin Sum gen Sum gen Sum gen Sum gen Sum gen S s S s S s S s S s ECE 4121 VLSI DEsign.44

Square Root Carry Select Adder bits 14 to 19 A s B s bits 9 to 13 A s B s bits 5 to 8 A s B s bits2to4 to A s Bs bits0to1 to As B s Setup P s G s Setup P s G s Setup P s G s Setup P s G s Setup 1 P sg s 0 carry +6 1 carry 0 carry 0 0 carry 0 0 carry 0 +5 +4 +3 1 carry 1 1 carry 1 1 carry 1 0 carry 0 +2 1 carry 1 C out mux +1 C s mux +1 C s mux +1 C s +1 mux C s mux +1 C s Cin Sum +1gen Sum gen Sum gen Sum gen Sum gen S s S s S s S s S s T add = t setup + 2 t carry + N t mux + t sum ECE 4121 VLSI DEsign.45

Parallel Prefix Adders (PPAs) Define carry operator on (G,P) signal pairs (G,P ) (G,P ) G (G,P) where G = G P G P = P P G!G P is associative, i.e., [(g,p ) (g,p )] (g,p ) = (g,p ) [(g,p ) (g,p )] ECE 4121 VLSI DEsign.46

PPA General Structure Given P and G terms for each bit position, computing all the carries is equal to finding all the prefixes in parallel (G 0 0,,P 0 0) (G 1 1,,P 1 1) (G 2 2,,P 2 2) (G N-2,,P N-2) ) (G N-1,,P N-1) ) Since is associative, we can group them in any order but note that it is not commutative P i, G i logic (1 unit delay) C i parallel prefix logic tree (1 unit delay per level) S i logic (1 unit delay) Measures to consider number of cells tree cell depth (time) tree cell area cell fan-in and fan-out max wiring length wiring congestion delay path variation (glitching) ECE 4121 VLSI DEsign.47

Brent-Kung PPA G 15 p 15 G 14 p 14 G 13 p 13 G 12 P 12 G 11 p 11 G 10 P 10 G 9 p 9 G 8 P 8 G 7 P 7 G 6 P 6 G 5 P 5 G 4 P 4 G 3 P 3 G 2 p 2 G 1 P 1 G 0 P 0 C in C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 C 8 C 7 C 6 C 5 C 4 C 3 C 2 C 1 ECE 4121 VLSI DEsign.48

Kogge-Stone PPF Adder G 15 P 15 G 14 P 14 G 13 P 13 G 12 P 12 G 11 P 11 G 10 P 10 G 9 P 9 G 8 P 8 G 7 P 7 G 6 P 6 G 5 P 5 G 4 P 4 G 3 P 3 G 2 P 2 G 1 P 1 G 0 P 0 C in C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 C 8 C 7 C 6 C 5 C 4 C 3 C 2 C 1 ECE 4121 VLSI DEsign.50 T add = t setup + log 2 N t + t sum

More Adder Comparisons 70 60 50 40 30 20 RCA CSkA VSkA KS PPA 10 0 8bits 16 bits 32 bits 48 bits 64 bits ECE 4121 VLSI DEsign.51

Adder Speed Comparisons 70 60 50 40 30 RCA MCC CCSkA VCSkA CCSlA B&K 20 10 16bts bits 32 bits 64 bits ECE 4121 VLSI DEsign.52

Adder Average Power Comparisons 35 30 25 RCA 20 MCC CCSkA 15 VCSkA CCSlA 10 B&K 5 0 16 bits 32 bits 64 bits ECE 4121 VLSI DEsign.53

PDP of Adder Comparisons 100 80 60 40 20 RCA MCCA CCSkA VCSkA CCSlA BKA 0 8 bits 16 bits 32 bits 48 bits 64 bits From Nagendra, 1996 ECE 4121 VLSI DEsign.54