Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1
A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH 2
Building Blocks for Digital Architectures Arithmetic unit - Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.) Memory - RAM, ROM, Buffers, Shift registers Control - Finite state machine (PLA, random logic.) - Counters Interconnect -Switches -Arbiters -Bus 3
An Intel Microprocessor 9-1 Mux 5-1 Mux a CARRYGEN g64 node1 ck1 SUMSEL REG sum sumb to Cache 9-1 Mux 2-1 Mux b SUMGEN + LU s0 s1 LU : Logical Unit 1000um Itanium has 6 integer execution units like this 4
Bit-Sliced Design Control Bit 3 Data-In Register Adder Shifter Multiplexer Bit 2 Bit 1 Bit 0 Data-Out Tile identical processing elements 5
Bit-Sliced Datapath From register files / Cache / Bypass Multiplexers Shifter Adder stage 1 Loopback Bus Loopback Bus Wiring Adder stage 2 Wiring Loopback Bus Bit slice 63 Adder stage 3 Sum Select Bit slice 2 Bit slice 1 Bit slice 0 To register files / Cache 6
Itanium Integer Datapath Fetzer, Orton, ISSCC 02 7
Adders 8
Full-Adder A B Cin Full adder Sum Cout 9
The Binary Adder A B Cin Full adder Sum Cout S = A B C i = ABC i + ABC i + ABC i + ABC i C o = AB + BC i + AC i 10
Express Sum and Carry as a function of P, G, D Define 3 new variable which ONLY depend on A, B Generate (G) = AB Propagate (P) = A B Delete = A B Can also derive expressions for S and C o based on D and P Note that we will be sometimes using an alternate definition for Propagate (P) = A + B 11
Complimentary Static CMOS Full Adder V DD V DD C i A B A B A B A C i X B C i V DD C i A S C i A B B V DD A B C i A C o B 28 Transistors 12
A Better Structure: The Mirror Adder V DD V DD V DD A "0"-Propagate A C i B B A Kill C o A B C i B C i S "1"-Propagate A Generate C i A B B A B C i A B 24 transistors 13
Mirror Adder Stick Diagram V DD A B C i B A C i C o C i A B C o S GND 14
The Mirror Adder The NMOS and PMOS chains are completely symmetrical. A maximum of two series transistors can be observed in the carrygeneration circuitry. When laying out the cell, the most critical issue is the minimization of the capacitance at node C o. The reduction of the diffusion capacitances is particularly important. The capacitance at node C o is composed of four diffusion capacitances, two internal gate capacitances, and six gate capacitances in the connecting adder cell. The transistors connected to C i are placed closest to the output. Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size. 15
Transmission Gate Full Adder P V DD A V DD A A P C i C i P S Sum Generation V DD B A P B A P P V DD C o Carry Generation C i C i Setup A C i P 16
One-phase dynamic CMOS adder 17
One-phase dynamic CMOS adder 18
One-phase dynamic CMOS adder 19
The Ripple-Carry Adder A 0 B 0 A 1 B 1 A 2 B 2 A 3 B 3 C i,0 C o,0 C o,1 C o,2 C o,3 FA FA FA FA (= C i,1 ) S 0 S 1 S 2 S 3 Worst case delay linear with the number of bits t d = O(N) t adder = (N-1)t carry + t sum Goal: Make the fastest possible carry path circuit 20
Inversion Property A B A B C i FA C o C i FA C o S S 21
Minimize Critical Path by Reducing Inverting Stages Even cell Odd cell A 0 B 0 A 1 B 1 A 2 B 2 A 3 B 3 C i,0 C o,0 C o,1 C o,2 C o,3 FA FA FA FA S 0 S 1 S 2 S 3 Exploit Inversion Property 22
Carry Look-Ahead Adders 23
Carry-Lookahead Adders 24
Carry-Lookahead Adders 25
Look-Ahead: Topology Expanding Lookahead equations: V DD C o k, = G k + P k ( G k 1 + P k 1 C ok 2 ), G 3 G 2 All the way: C o k, = G k + P k ( G k 1 + P k 1 ( + P 1 ( G 0 + P 0 C i0 ))), C i,0 G 1 G 0 C o,3 P 0 P 1 P 2 P 3 26
Manchester Carry Chain V DD P i φ P i V DD C o G i C i Ci C o G i P i D i φ 27
Manchester Carry Chain φ V DD P 0 P 1 P 2 P 3 C 3 C i,0 G 0 G 1 G 2 G 3 φ C 0 C 1 C 2 C 3 28
Manchester Carry Chain Stick Diagram Propagate/Generate Row V DD P i G i φ P i + 1 G i + 1 φ C i C i - 1 C i + 1 GND Inverter/Sum Row 29
Carry-Bypass Adder C i,0 P 0 G 1 P 0 G 1 P 2 G 2 P 3 G 3 C o,0 C o,1 C o,2 FA FA FA FA C o,3 Also called Carry-Skip P 0 G 1 P 0 G 1 P 2 G 2 P 3 G 3 BP=P o P 1 P 2 P 3 C i,0 C o,0 C o,1 C o,2 FA FA FA FA Multiplexer C o,3 Idea: If (P0 and P1 and P2 and P3 = 1) then C o3 = C 0, else kill or generate. 30
Carry-Bypass Adder (cont.) Bit 0 3 Bit 4 7 Bit 8 11 Bit 12 15 Setup t setup Setup t bypass Setup Setup Carry propagation Carry propagation Carry propagation Carry propagation Sum Sum Sum t sum Sum M bits t adder = t setup + M tcarry + (N/M 1)t bypass + (M 1)t carry + t sum 31
Carry Ripple versus Carry Bypass t p ripple adder bypass adder 4..8 N 32
Carry-Select Adder Setup P,G "0" "0" Carry Propagation "1" "1" Carry Propagation C o,k-1 Multiplexer Co,k+3 Sum Generation Carry Vector 33
Carry Select Adder: Critical Path Bit 0 3 Bit 4 7 Bit 8 11 Bit 12 15 Setup Setup Setup Setup 0 0-Carry 0 0-Carry 0 0-Carry 0 0-Carry 1 1-Carry 1 1-Carry 1 1-Carry 1 1-Carry Multiplexer Multiplexer Multiplexer Multiplexer C i,0 C o,3 C o,7 C o,11 C o,15 Sum Generation Sum Generation Sum Generation Sum Generation S 0 3 S 4 7 S 8 11 S 12 15 34
Linear Carry Select Bit 0-3 Bit 4-7 Bit 8-11 Bit 12-15 Setup Setup Setup Setup (1) "0" (1) "0" Carry "0" "0" Carry "0" "0" Carry "0" "0" Carry C i,0 "1" "1" Carry (5) (5) Multiplexer "1" Carry "1" Carry "1" Carry "1" "1" "1" (5) (5) (5) (6) (7) (8) Multiplexer Multiplexer Multiplexer (9) Sum Generation Sum Generation Sum Generation Sum Generation S 0-3 S 4-7 S 8-11 S 12-15 (10) 35
Square Root Carry Select Bit 0-1 Bit 2-4 Bit 5-8 Bit 9-13 Bit 14-19 Setup Setup Setup Setup (1) "0" Carry "0" (1) "0" "0" Carry "0" "0" Carry "0" "0" Carry C i,0 "1" Carry "1" Carry "1" Carry "1" Carry "1" "1" "1" "1" (3) (3) (4) (5) (6) (4) (5) (6) (7) Multiplexer Multiplexer Multiplexer Multiplexer (7) Mux (8) Sum Generation Sum Generation Sum Generation Sum Generation Sum S 0-1 S 2-4 S 5-8 S 9-13 S 14-19 (9) 36
Adder Delays - Comparison 50 t p (in unit delays) 40 30 20 10 Ripple adder Linear select Square root select 0 0 20 40 N 60 37
O Operator Definizione 38
Properties of the O operator 39
Properties of the O operator 40
Group Generate and Propagate 41
Group Generate and Propagate 42
Group Generate and Propagate 43
Look-Ahead - Basic Idea A 0, B 0 A 1, B 1 A N-1, B N-1 C i,0 P 0 C i,1 P 1 C i, N-1 P N-1 S 0 S 1 S N-1 C o k, = fa ( k, B k, C o k ) = G k + P k C o k 1, 1, 44
Logarithmic Look-Ahead Adder A 0 F A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 0 A 1 t p N A 2 A 3 A 4 A 5 A 6 A 7 F t p log 2 (N) 45
Brent-Kung BLC adder 46
Folding of the inverse tree 47
Folding the inverse tree 48
Dense tree with minimum fan-out 49
Dense tree with simple connections 50
Carry Lookahead Trees C o2, = C o1 C o0, = G 0 + P 0 C i, 0, = G 1 + P 1 G 0 + P 1 P 0 C i, 0 G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C i0 = ( G 2 + P 2 G 1 ) + ( P 2 P 1 )( G 0 + P 0 C i 0 ) = G 2:1 + P 2:1 C o0 Can continue building the tree hierarchically.,,, 51
Tree Adders (A0, B 0 ) (A1, B 1 ) (A2, B 2 ) (A3, B 3 ) (A4, B 4 ) (A5, B 5 ) (A6, B 6 ) (A7, B 7 ) (A8, B 8 ) (A9, B 9 ) (A10, B 10 ) (A11, B 11 ) (A12, B 12 ) (A13, B 13 ) (A14, B 14 ) (A15, B 15 ) S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 16-bit radix-2 Kogge-Stone tree 52
Tree Adders (a0, b ) 0 (a1, b ) 1 (a2, b ) 2 (a3, b ) 3 (a4, b ) 4 (a5, b ) 5 (a6, b ) 6 (a7, b ) 7 (a8, b ) 8 (a9, b ) 9 (a10, b ) 10 (a11, b ) 11 (a12, b ) 12 (a13, b ) 13 (a14, b ) 14 (a15, b ) 15 S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 16-bit radix-4 Kogge-Stone Tree 53
Sparse Trees (a0, b ) 0 (a1, b ) 1 (a2, b ) 2 (a3, b ) 3 (a4, b ) 4 (a5, b ) 5 (a6, b ) 6 (a7, b ) 7 (a8, b ) 8 (a9, b ) 9 (a10, b ) 10 (a11, b ) 11 (a12, b ) 12 (a13, b ) 13 (a14, b ) 14 (a15, b ) 15 S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 16-bit radix-2 sparse tree with sparseness of 2 54
Tree Adders (A0, B ) 0 (A1, B ) 1 (A2, B ) 2 (A3, B ) 3 (A4, B ) 4 (A5, B ) 5 (A6, B ) 6 (A7, B ) 7 (A8, B ) 8 (A9, B ) 9 (A10, B ) 10 (A11, B ) 11 (A12, B ) 12 (A13, B ) 13 (A14, B ) 14 (A15, B ) 15 S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 Brent-Kung Tree 55
Example: Domino Adder V DD V DD Clk G i = a i b i Clk P i = a i + b i a i a i b i b i Clk Clk Propagate Generate 56
Example: Domino Adder V DD V DD Clk k P i:i-2k+1 Clk k G i:i-2k+1 P i:i-k+1 P i:i-k+1 G i:i-k+1 P i-k:i-2k+1 G i-k:i-2k+1 Propagate Generate 57
Example: Domino Sum V DD V DD Keeper Clk Clkd Sum Gi:0 Clk S i 0 Clkd Clk Gi:0 S i 1 Clk 58
Adders Summary 59
Multipliers 60
The Binary Multiplication 61
The Binary Multiplication 62
The Binary Multiplication + x 1 0 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 0 Multiplicand Multiplier Partial products Result 63
The Array Multiplier X 3 X 2 X 1 X 0 Y 0 X 3 X 2 X 1 X 0 Y 1 Z 0 HA FA FA HA X 3 X 2 X 1 X 0 Y 2 Z 1 FA FA FA HA X3 X 2 X 1 X 0 Y 3 Z 2 FA FA FA HA Z 7 Z 6 Z 5 Z 4 Z 3 64
The MxN Array Multiplier Critical Path HA FA FA HA FA FA FA HA Critical Path 1 Critical Path 2 FA FA FA HA Critical Path 1 & 2 65
Carry-Save Multiplier HA HA HA HA HA FA FA FA HA FA FA FA HA FA FA HA Vector Merging Adder 66
Multiplier Floorplan X 3 X 2 X 1 X 0 Y 0 Y 1 C S C S C S C S Z 0 HA Multiplier Cell FA Multiplier Cell Y 2 C S C S C S C S Z 1 Vector Merging Cell Y 3 C S C S C S C S Z 2 X and Y signals are broadcasted through the complete array. ( ) C S C S C S C S Z 7 Z 6 Z 5 Z 4 Z 3 67
Wallace-Tree Multiplier Partial products First stage 6 5 4 3 2 1 0 6 5 4 3 2 1 0 Bit position (a) (b) Second stage Final adder 6 5 4 3 2 1 0 6 5 4 3 2 1 0 FA (c) HA (d) 68
Wallace-Tree Multiplier Partial products x 3 y 3 x 3 y 2 x 2 y 2 x 3 y 1 x 1 y 2 x 3 y 0 x 1 y 1 x 2 y 0 x 0 y 1 x 2 y 3 x1 y 3 x 0 y 3 x 2 y 1 x 0 y 2 x 1 y 0 x 0 y 0 First stage HA HA Second stage FA FA FA FA Final adder z 7 z 6 z 5 z 4 z 3 z 2 z 1 z 0 69
Wallace-Tree Multiplier y 0 y 1 y 2 FA C i-1 y 0 y 1 y 2 y 3 y 4 y 5 y 3 FA FA C i FA C i-1 C i C i C i-1 C i-1 y 4 FA C i FA C i-1 C i C i-1 y 5 C i FA FA C S C S 70
Wallace Tree Mult.. Performance 71
Wallace Tree Multiplier Complexity 72
4:2 Adder 73
Eight-input input Tree 74
Architectural comparison of multiplier solutions 75
SPIM Architecture 76
SPIM Pipe Timing 77
SPIM Microphotograph 78
SPIM clock generator circuit 79
Binary Tree Multiplier Performance 80
Binary Tree Multiplier Complexity 81
Multipliers Summary Optimization Goals Different Vs Binary Adder Once Again: Identify Critical Path Other possible techniques - Logarithmic versus Linear (Wallace Tree Mult) - Data encoding (Booth) - Pipelining FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION 82
Multipliers Summary 83
Booth encoding 84
Booth encoding 85
Tree Multiplier with Booth Encoding 86
The f.p. addition algorithm Exponent comparison and swap (if needed) Mantissas aligment Addition Normalization of result Rounding of result 87
The f.p. multiplication algorithm Mantissas multiplication Exponent addition Mantissa normalization and exponent adjusting (if needed) Rounding of result 88
Dividers 89
Iterative Division (Newton-Raphson) 90
Iterative Division (Newton-Raphson) 91
Quadratic Convergence of the Newton Method 92
Properties of the Newton Method Asymptotically quadratic convergence Correction of round-off errors Final multiplication by a generates a round-off problem incompatible with the Standard IEEE-754 93
Iterative Division (Goldschmidt) 94
Iterative division (Goldschmidt) 95
Iterative Division (Goldschmidt) The sequence of x n tends to a/b The convergence is quadratic In its present form, this method is affected by roud-off errors 96
Modified Goldschmidt Algorithm (correction of round-off off errors) 97
The Link between the Newton-Raphson and Goldschmidt Methods 98
The Link between the Newton-Raphson and Goldschmidt Methods 99
Comparison between Newton and Goldschmidt methods The Newton and Goldschmidt methods are essentially equivalent; Both methods exhibit an asymptotically quadratic convergence; Both methods are able to correct roundoff errors; The Goldschmidt methods directly computes the a/b ratio. 100
Shifters 101
The Binary Shifter Right nop Left A i B i A i-1 B i-1 Bit-Slice i... 102
The Barrel Shifter A 3 B 3 Sh1 A 2 B 2 A 1 Sh2 B 1 : Data Wire : Control Wire A 0 Sh3 B 0 Sh0 Sh1 Sh2 Sh3 Area Dominated by Wiring 103
4x4 barrel shifter A 3 A 2 A 1 A 0 Sh0 Sh1 Sh2 Sh3 Width barrel ~ 2 p m M Buffer 104
Logarithmic Shifter Sh1 Sh1 Sh2 Sh2 Sh4 Sh4 A 3 B 3 A 2 B 2 A 1 B 1 A 0 B 0 105
0-77 bit Logarithmic Shifter A 3 Out3 A 2 Out2 A 1 Out1 A 0 Out0 106
ALUs 107
Two-bit MUX 108
Two-bit MUX Truth Table 109
Two-bit Selector Truth Table 110
Carry-chain Truth Table 111
ALU block diagram (Mead-Conway) 112
ALU Operations 113
ALU Operations 114
ALU Operations 115
ALU Operations 116
ALU Operations 117
MIPS-X Instruction Format 118
Pipeline dependencies in MIPS-X 119
Die Photo of MIPS-X 120
MIPS-X Architecture 121
MIPS-X Instruction Cache-miss timing 122
MIPS-X Tag Memory 123
MIPS-X Valid Store Array 124
RAM Sense Amplifier 125
CMOS Dual-port Register Cell 126
Self-timed bit-line write circuit 127
Register bypass logic 128
Schematic of comparator circuit 129
Squash FSM 130
Cache-miss FSM 131