CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19 S.1

Major Components of a Computer Processor Devices Control Memory Input Datapath Output Modern processor architecture styles (CSE 431) Pipelined, single issue (e.g., ARM) Pipelined, hardware controlled multiple issue superscalar Pipelined, software controlled multiple issue VLIW Pipelined, multiple issue from multiple process threads - multithreaded Sp11 CMPEN 411 L19 S.2

Basic Building Blocks Datapath Execution units - Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches, TLBs, DRAM, buffers Sp11 CMPEN 411 L19 S.3

MIPS 5-Stage Pipelined (Single Issue) Datapath Fetch Decode Execute Memory WriteBack PC 1 0 4 Read Address I$ Add IF/Dec Read Addr 1 Register Read Read Addr 2Data 1 File Write Addr Write Data Read Data 2 Dec/Exec Shift left 2 0 1 Add ALU Exec/Mem Address Write Data D$ Read Data pipeline stage isolation register Mem/WB 1 0 Icache precharge Sign 16 Extend 32 Dcache precharge RegWrite clk Sp11 CMPEN 411 L19 S.4

Datapath Bit-Sliced Organization Control Flow Bit 3 Bit 2 Bit 1 Multiplexer Pipeline Register Register File Pipeline Register Multiplexer Adder Shifter Pipeline Register Pipeline Register From I$ Data Flow To/From D$ Tile identical bit-slice elements Sp11 CMPEN 411 L19 S.5 Bit 0

The Binary Adder A B Cin Full adder Sum Cout S = A B C i = ABC i + ABC i + ABC i + ABC i C o = AB+ BC i + AC i Sp11 CMPEN 411 L19 S.6

The 1-bit Binary Adder A B C in A B C in C out S carry status 1-bit Full Adder (FA) C out G = A & B P = A B K =!A &!B S 0 0 0 0 0 kill 0 0 1 0 1 kill 0 1 0 0 1 propagate 0 1 1 1 0 propagate 1 0 0 0 1 propagate 1 0 1 1 0 propagate 1 1 0 1 0 generate 1 1 1 1 1 generate S = A B C in = P C in C out = A&B A&C in B&C in = G P&C in (majority function) A VERY common operation often in the critical path Sp11 CMPEN 411 L19 S.7

Complimentary Static CMOS Full Adder A direct implementation in CMOS needs 28 transistors (pp.565) C o =AB+BC i +AC i, S=ABC i +!C o (A+B+C i ) V DD V DD C i A B A B A B A C i X B C i V DD C i A S C i A B B V DD A B C i A C o B 28 Transistors Sp11 CMPEN 411 L19 S.8

The 1-bit Binary Adder A B C in A B C in C out S carry status 1-bit Full Adder (FA) C out G = A & B P = A B K =!A &!B S 0 0 0 0 0 kill 0 0 1 0 1 kill 0 1 0 0 1 propagate 0 1 1 1 0 propagate 1 0 0 0 1 propagate 1 0 1 1 0 propagate 1 1 0 1 0 generate 1 1 1 1 1 generate S = A B C in = P C in C out = A&B A&C in B&C in = G P&C in (majority function) How can we use it to build a 64-bit adder? How can we modify it easily to build an adder/subtractor? How can we make it better (faster, lower power, smaller)? Sp11 CMPEN 411 L19 S.9

A 64-bit Adder/Subtractor Ripple Carry Adder (RCA) built out of 64 FAs add/subt A 0 C 0 =C in 1-bit FA S 0 Subtraction complement all subtrahend bits (xor gates) and set the low order carry-in RCA B 0 B 1 A1 A 2 C 1 1-bit FA S 1 C 2 1-bit FA S 2 advantage: simple logic, so small (low cost) B 2... C 3 disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption) B 63 A 63 C 63 1-bit FA S 63 C 64 =C out Sp11 CMPEN 411 L19 S.10

Ripple Carry Adder (RCA) A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C out =C 4 FA FA FA FA C 0 =C in S 3 S 2 S 1 S 0 T adder (N-1) T carry + T sum T = O(N) worst case delay Real Goal: Make the fastest possible carry path Sp11 CMPEN 411 L19 S.11

Inversion Property Inverting all inputs to a FA results in inverted values for all outputs A B A B C out FA C in C out FA C in S S!S (A, B, C in ) = S(!A,!B,!C in )!C out (A, B, C in ) = C out (!A,!B,!C in ) Sp11 CMPEN 411 L19 S.12

Exploiting the Inversion Property A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C out =C 4 FA FA FA FA C 0 =C in S 3 S 2 S 1 S 0 inverted cell regular cell Minimizes the critical path (the carry chain) by eliminating inverters between the FAs Now need two flavors of FAs Sp11 CMPEN 411 L19 S.13

Mirror Adder 24+4 transistors 0-propagate 1-propagate A B B A B kill A C in!c out C in!s A generate A B B A B C in C in B A C in A B C out = A&B B&C in A&C in SUM = A&B&C in C OUT &(A B C in ) Sp11 CMPEN 411 L19 S.14

Mirror Adder Features The NMOS and PMOS chains are completely symmetrical with a maximum of two series transistors in the carry circuitry, guaranteeing identical rise and fall transitions if the NMOS and PMOS devices are properly sized. When laying out the cell, the most critical issue is the minimization of the capacitances at node!c out (four diffusion capacitances, two internal gate capacitances, and two inverter gate capacitances). Shared diffusions can reduce the stack node capacitances. The transistors connected to C in are placed closest to the output. Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size. Sp11 CMPEN 411 L19 S.15

Fast Carry Chain Design The key to fast addition is a low latency carry network What matters is whether in a given position a carry is generated G i = A i & B i propagated P i = A i B i (sometimes use A i B i ) annihilated (killed) K i =!A i &!B i Giving a carry recurrence of C i+1 = G i P i &C i C 1 = G 0 P 0 &C 0 C 2 = G 1 P 1 &G 0 P 1 &P 0 &C 0 C 3 = G 2 P 2 &G 1 P 2 &P 1 &G 0 P 2 &P 1 &P 0 &C 0 C 4 = G 3 P 3 &G 2 P 3 &P 2 &G 1 P 3 &P 2 &P 1 &G 0 P 3 &P 2 &P 1 &P 0 &C 0 Sp11 CMPEN 411 L19 S.16

Manchester Carry Chain (MCC) Switches controlled by G i and P i!c i+1 G i Pi!C i clk Total delay of time to form the switch control signals G i and P i signal propagation delay through N switches in the worst case Sp11 CMPEN 411 L19 S.17

4-bit Sliced MCC Adder A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 clk & G P & G P & G P & G P!C 4!C 0!C 3!C 2!C 1 S 3 S 2 S 1 S 0 Sp11 CMPEN 411 L19 S.18

8-bit MCC Adder!C 7 & 4-bit slice MCC & 4-bit slice MCC!C 0 Its really hard to beat the speed of a well designed MCC for word lengths of 8 bits or less! Sp11 CMPEN 411 L19 S.19

Carry Skip Adder (a.k.a. Carry Bypass Adder) A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 C 4 FA FA FA FA C 0 C 4 S 3 S 2 S 1 S 0 BP = P 0 &P 1 &P 2 &P 3 Block Propagate If (P 0 & P 1 & P 2 & P 3 = 1) then C 4 = C 0 otherwise the block itself kills or generates the carry internally Sp11 CMPEN 411 L19 S.20

Carry-Skip Chain Implementation carry-out block carry-out BP block carry-in P 3 P 2 P 1 P 0!C out C in G 3 G 2 G 1 G 0 BP Sp11 CMPEN 411 L19 S.21

16 bit, 4-bit Block Carry Skip Adder bits 12 to 15 bits 8 to 11 bits 4 to 7 bits 0 to 3 Setup Setup Setup Setup Carry Propagation Carry Propagation Carry Propagation Carry Propagation C i,0 Sum Sum Sum Sum Worst-case delay carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15 T add = t setup + B t carry + ((N/B) - 1) t skip +(B-1) t carry + t sum Sp11 CMPEN 411 L19 S.22

Optimal Skip Block Size and Add Time Assuming one stage of ripple (t carry ) has the same delay as one skip logic stage (t skip ) and both are 1 T CSkA = 1 + B + (N/B-1) + B-1 + 1 t setup ripple in skips ripple in t sum block 0 last block = 2B + N/B So the optimal block size, B, is dt CSkA /db = 0 (N/2) = B opt And the optimal time is Optimal T CSkA = 4 (n/2) 1 = 2 (2n) 1 Sp11 CMPEN 411 L19 S.23

RCA, Carry Skip Adder Comparison 70 60 50 40 30 RCA CSkA 20 10 0 B=6 B=5 B=4 B=2 B=3 8 bits 16 bits 32 bits 48 bits 64 bits Sp11 CMPEN 411 L19 S.24

Carry Skip Adder Extensions Variable block sizes A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks, so can have bigger blocks for the inner carries without increasing the overall delay C out C in Sp11 CMPEN 411 L19 S.25

Carry Select Adder A s B s Precompute the carry out of each block for both carry_in = 0 and carry_in = 1 (can be done for all blocks in parallel) and then select the correct one C out 4-b Setup P s G s 0 carry propagation 0 1 carry propagation 1 multiplexer C s Sum generation C in Sp11 CMPEN 411 L19 S.26 S s

Carry Select Adder: Critical Path bits 12 to 15 A s B s A s bits 8 to 1 B s A s bits 4 to 7 B s A s bits 0 to 3 B s Setup P s G s Setup P s G s Setup P s G s 1 Setup P s G s 0 carry 0 carry 0 carry 0 carry 0 +4 1 carry 1 carry 1 carry 1 carry 1 C out +1 mux C s +1 mux C s +1 mux C s +1 mux C s C in +1 Sum gen Sum gen Sum gen Sum gen S s S s S s S s T add = t setup + B t carry + N/B t mux + t sum Sp11 CMPEN 411 L19 S.27

Square Root Carry Select Adder bits 14 to 19 A s B s bits 9 to 13 A s B s A s bits 5 to 8 B s bits 2 to 4 A s Bs bits 0 to 1 As B s Setup P s G s Setup P s G s Setup P s G s Setup P s G s 1 Setup P sg s 0 carry +6 0 carry 0 +5 0 carry +4 0 0 carry +3 0 0 carry +2 0 1 carry 1 carry 1 1 carry 1 1 carry 1 1 carry 1 C out +1 mux C s +1 mux C s +1 mux C s +1 mux C s +1 mux Cin C s +1 Sum gen Sum gen Sum gen Sum gen Sum gen S s S s S s S s S s T add = t setup + 2 t carry + 2N t mux + t sum Sp11 CMPEN 411 L19 S.28

Look-Ahead: Topology C o k, = f( A k, B k, C o k ) = G k + P k C o k 1, 1, Expanding Lookahead equations: C o k, = G k + P k ( G k 1 + P k 1 C o k 2 ), All the way: C o k, = G k + P k ( G k 1 + P k 1 ( + P 1 ( G 0 + P 0 C i 0 ))), Sp11 CMPEN 411 L19 S.29

LookAhead - Basic Idea A 0, B 0 A 1, B 1 A N-1, B N-1 C i,0 P 0 C i,1 P 1 C i, N-1 P N-1 S 0 S 1 S N-1 Sp11 CMPEN 411 L19 S.30

Look-Ahead: Topology C o k, = G k + P k ( G k 1 + P k 1 ( + P 1 ( G 0 + P 0 C i 0 ))) V DD, G 3 G 2 G 1 G 0 C i,0 C o,3 P 0 P 1 P 2 P 3 Sp11 CMPEN 411 L19 S.31

Logarithmic Look-Ahead Adder A 0 F A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 0 A 1 t p N A 2 A 3 A 4 A 5 A 6 A 7 F t p log 2 (N) Sp11 CMPEN 411 L19 S.32

Carry Lookahead Trees C o 2, = C o 1 C o 0, = G 0 + P 0 C i, 0, = G 1 + P 1 G 0 + P 1 P 0 C i, 0 G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C i 0 = ( G 2 + P 2 G 1 ) + ( P 2 P 1 )( G 0 + P 0 C i 0 ) = G 2:1 + P 2:1 C o 0,,, Can continue building the tree hierarchically. Sp11 CMPEN 411 L19 S.33

Carry Operator Define carry operator on (G,P) signal pairs (G,P ) (G,P ) G (G,P) where G = G P &G P = P &P G!G P is associative, i.e., [(g,p ) (g,p )] (g,p ) = (g,p ) [(g,p ) (g,p )] Sp11 CMPEN 411 L19 S.34

PPA (Partially Prefix Adder) General Structure Given P and G terms for each bit position, computing all the carries is equal to finding all the prefixes in parallel (G 0,P 0 ) (G 1,P 1 ) (G 2,P 2 ) (G N-2,P N-2 ) (G N-1,P N-1 ) Since is associative, we can group them in any order P i, G i logic (1 unit delay) C i parallel prefix logic tree (1 unit delay per level) S i logic (1 unit delay) Measures to consider number of cells tree cell depth (time) tree cell area cell fan-in and fan-out max wiring length wiring congestion delay path variation (glitching) Sp11 CMPEN 411 L19 S.35

Brent-Kung PPA G 15 p 15 G 14 p 14 G 13 p 13 G 12 P 12 G 11 p 11 G 10 P 10 G 9 p 9 G 8 P 8 G 7 P 7 G 6 P 6 G 5 P 5 G 4 P 4 G 3 P 3 G 2 p 2 G 1 P 1 G 0 P 0 C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 C 8 C 7 C 6 C 5 C 4 C 3 C 2 C 1 Sp11 CMPEN 411 L19 S.36

A Faster Yet PPA Brent-Kung (BK) adder has the time bound of T BK = 1 + (2log N 2) + 1 There are even faster PPA approaches that are used in most modern day machines for operands of 32 bits or greater Kogge-Stone (KS) faster pp tree (logn for KS versus 2logN-2 for BK) fan-out of carry cell limited to two takes more cells and has more wiring Sp11 CMPEN 411 L19 S.37

Kogge-Stone PPF Adder G 15 P 15 G 14 P 14 G 13 P 13 G 12 P 12 G 11 P 11 G 10 P 10 G 9 P 9 G 8 P 8 G 7 P 7 G 6 P 6 G 5 P 5 G 4 P 4 G 3 P 3 G 2 P 2 G 1 P 1 G 0 P 0 C in C 16 C 15 C 14 C 13 C 12 C 11 C 10 C 9 C 8 C 7 C 6 C 5 C 4 C 3 C 2 C 1 Sp11 CMPEN 411 L19 S.38 T add = t setup + log 2 N t + t sum

PPA Comparisons Measure BK PPA N=64 KS PPA N=64 # of cells 2N - 2 - logn 129 NlogN - N + 1 321 tree depth 2logN - 2 10 logn 6 tree area (N/2) * (2logN -2) 320 N * logn 384 (WxH) cell fan-in 2 2 2 2 cell fan-out logn 6 2 2 max wire N/4 16 N/2 32 length wiring sparse dense density glitching high low Sp11 CMPEN 411 L19 S.39

More Adder Comparisons 70 60 50 40 30 RCA CSkA KS PPA 20 10 0 8 bits 16 bits 32 bits 48 bits 64 bits Sp11 CMPEN 411 L19 S.40

State of art Sp11 CMPEN 411 L19 S.41

Next Lecture and Reminders Next lecture Multiplier Design - Reading assignment Rabaey, et al, 11.4 Sp11 CMPEN 411 L19 S.42