Cost/Performance Tradeoffs: a case study Digital Systems Architecture I. L10 - Multipliers 1
Binary Multiplication x a b n bits n bits EASY PROBLEM: design combinational circuit to multiply tiny (1-, 2-, 3-bit) operands... 2n bits since (2 n -1) 2 < 2 2n HARD PROBLEM: design circuit to multiply BIG (32-bit, 64-bit) numbers We can make big multipliers out of little ones! Engineering Principle: Exploit STRUCTURE in in problem. L10 - Multipliers 2
Making a 2n-bit multiplier using n-bit multipliers Given n-bit multipliers: a H a L a x b = ab n bits n bits 2n bits x b H b L Synthesize 2n-bit multipliers: x ab 4n bits a 2n bits b 2n bits a L b L a L b H a H b L a H b H ab L10 - Multipliers 3
Our Basis: n=1: minimalist starting point Multiplying two 1-bit numbers is pretty simple: a x b = 0 ab Of course, we could start with optimized combinational multipliers for larger operands; e.g. a 1 a 0 b 1 b 0 2 2 2-bit Multiplier 4 c 3 c 2 c 1 c 0 the logic gets more complex, but some optimizations are possible... L10 - Multipliers 4
2n-bit by 2n-bit multiplication: Our induction step: 1. Divide multiplicands into n-bit pieces 2. Form 2n-bit partial products, using n-bit by n-bit multipliers. 3. Align appropriately 4. Add. a L b H REGROUP partial products - 2 additions rather than 3! a H a L x b H b a H b H a L b L L a H b L Induction: we can use the same structuring principle to build a 4n-bit multiplier from our newly-constructed 2n-bit ones... a b L10 - Multipliers 5
Brick Wall view of partial products Making 4n-bit multipliers from n-bit ones: 2 induction steps x a 3 a2 a1 a0 b 3 b 2 b 1 b 0 a 2 b 3 a 3 b 3 a 2 b 2 a 0 b 3 a 1 b 3 a 0 b 2 a 1 b 2 a 0 b 1 a 1 b 1 a 0 b 0 a 3 b 2 a 2 b 1 a 3 b 1 a 2 b 0 a 3 b 0 a 1 b 0 L10 - Multipliers 6
Multiplier Cookbook: Chapter 1 Given problem: x a 3 a2 a1 a0 b 3 b 2 b 1 b 0 Subassemblies: Partial Products Adders Step 1: Form (& arrange) Partial Products: a 0 b 3 a 1 b 3 a 0 b 2 a 2 b 3 a 1 b 2 a 0 b 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 a 3 b 2 a 2 b 1 a 1 b 0 MULT ADD a 3 b 1 a 2 b 0 a 3 b 0 Step 2: Sum L10 - Multipliers 7
Performance/Cost Analysis "Order Of" notation: "g(n) is of order f(n)" g(n) = Θ (f(n)) g(n) = Θ(f(n)) if there exist C 2 C 1 > 0, such that for all but finitely many integral n 0 c 1 f(n) g(n) c 2 f(n) Example: n 2 +2n+3 = Θ(n 2 ) since n 2 (n 2 +2n+3) 2n 2 "almost always" g(n) = O(f(n)) Θ(...) implies both inequalities; O(...) implies only the second. Partial Products: Things to Add: Adder Width: Hardware Cost: n 2 2n-2 n? = = = = Θ(n 2 ) Θ(n) Θ(n) Θ(n 2 ) Latency: Ο(n 2 )?? L10 - Multipliers 8
Observations: a 0 b 3 a 1 b 3 a 0 b 2 a 2 b 3 a 1 b 2 a 0 b 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 MULT ADD Θ(n 2 ) partial products. Θ(n 2 ) full adders. Hmmm. a 3 b 2 a 2 b 1 a 1 b 0 a 3 b 1 a 2 b 0 a 3 b 0 L10 - Multipliers 9
Repackaging Function Engineering Principle #2: #2: Put Put the the Solution where where the the Problem is. is. Θ(n 2 ) partial products. Θ(n 2 ) full adders. MULT ADD ab 0 3 ab 1 3 ab 0 2 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 1 1 ab ab 3 2 ab 2 1 ab 1 0 ab 3 1 ab 2 0 ab 3 0 ab 0 0 How about n 2 blocks, each doing a little multiplication and a little addition? L10 - Multipliers 10
b 3 b 2 ab 0 3 b 1 ab 1 3 ab 0 2 b 0 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 ab 1 1 ab 0 0 ab 3 2 ab 2 1 ab 1 0 a 0 ab 3 1 ab 2 0 a 1 ab 3 0 a 2 a 3 C i+1 A i FA B i Goal: Array of Identical Multiplier Cells C i C k+2 S k+1 S k Single "brick" of brick-wall array... Forms partial product Adds to accumulating sum along with carry S k+1 S k Necessary Component: Full Adder Takes 2 addend bits plus carry bit. Produces sum and carry output bits. b i a j C k (A+B) i CASCADE to form an n-bit adder. L10 - Multipliers 11
Design of 1-bit multiplier "Brick": b 3 b 2 ab 0 3 ab 1 3 ab 0 2 ab 2 3 ab 1 2 ab 0 1 ab 3 3 ab 2 2 1 1 ab ab 3 2 ab 2 1 ab 1 0 ab 3 1 ab 2 0 ab 3 0 a 2 a 3 ab 0 0 b 1 b 0 a 1 a 0 Brick design: AND gate forms 1x1 product 2-bit sum propagates from top to bottom Carry propagates to left Array Layout: operand bits bused diagonally Carry bits propagate right-to-left Sum bits propagate down C k+2 S k+1 0 FA FA S k+1 S k S k b i a j C k L10 - Multipliers 12
Latency revisited Here s our combinational multiplier: a 3 a 2 a 1 a 0 a0 b3 a1 b3 a2 b3 a3 b3 3 2 2 1 b 3 1 2 0 1 2 2 a1 b1 a0 b0 3 1 a3 b0 0 2 2 0 1 0 b 2 b 1 b 0 What s its propagation delay? Naive (but valid) bound: O(n) addtions O(n) time for each addition Hence O(n 2 ) time required On closer inspection: Propagation only toward left, bottom Hence longest path bounded by length + width of array: O(n+n) = O(n)! L10 - Multipliers 13
Combinational Multiplier: Multiplier Cookbook: Chapter 2 a 3 + a 2 a 1 a 0 a0 b3 a1 b3 a2 b3 a3 b3 3 2 2 1 b 3 1 2 0 1 2 2 a1 b1 a0 b0 3 1 a3 b0 0 2 2 0 1 0 b 2 b 1 b 0 Hardware for n by n bits: Latency: Throughput: Θ(n 2 ) Θ(n) Θ(1/n) L10 - Multipliers 14
Combinational Multiplier: best bang for the buck? Suppose we have LOTS of multiplications. Can we do better from a cost/performance standpoint? PIPELINING L10 - Multipliers 15
The Pipelining Bandwagon... where do I get on? WE HAVE: Pipeline rules - "well formed pipelines" Plenty of registers Demand for higher throughput. What do we do? Where do we define stages? a 3 a 2 a 0 b 3 b a 1 a0 b3 a1 b3 a0 b2 a2 b3 a1 b2 a0 b1 2 b 1 a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a2 b0 a3 b0 b 0 L10 - Multipliers 16
Stupid Pipeline Tricks a0 b3 a2 b3 a1 b3 a1 b2 a0 b2 a0 b1 gottreak that long carry chain! a3 b3 a2 b2 a1 b1 a0 b0 a3 b2 a2 b1 a1 b0 a3 b1 a3 b0 a2 b0 Stages: Θ(n) Clock Period: Θ(n) Hardware cost for n by n bits: Θ(n 2 ) Latency: Θ(n 2 ) Throughput: Θ( 1/n ) L10 - Multipliers 17
Even Stupider Pipeline Tricks a3 b3 a2 b3 a3 b2 a1 b3 a2 b2 a0 b3 a0 b2 1 2 0 1 a1 b1 2 1 a3 b1 3 0 2 0 1 0 0 0 WORSE idea: Doesn t break long combinational paths NOT a well-formed pipeline...... different register counts on alternative paths... data crosses stage boundaries in both directions! Back to basics: what s the point of pipelining, anyhow? L10 - Multipliers 18
Breaking O(n) combinational paths LONG PATHS go down, to left: Break array into diagonal slices Segment every long combinational path a 3 a 2 a3 b3 a 0 b 3 b 2 a a0 b 1 3 a0 b2 a1 b3 a0 b1 a1 b2 a0 b0 a2 b3 a1 b1 a2 b2 a1 b0 a3 b2 a3 b1 a3 b0 2 1 2 0 b 1 b 0 GOAL: Θ (n) stages; Θ (1) clock period! L10 - Multipliers 19
Multiplier Cookbook: Chapter 3 Stages: Clock Period: Hardware cost for n by n bits: Latency: Well-formed pipeline (careful!) Constant (high!) throughput, independently of operand size.... but suppose we don t need the throughput? a 3 Θ (n) Θ (1) Θ (n 2 ) Θ (n) Throughput: Θ (1) a 2 a3 b3 a 0 b 3 b 2 a a0 b 1 3 a0 b2 a1 b3 a0 b1 a1 b2 a0 b0 a2 b3 a1 b1 a2 b2 a1 b0 a3 b2 a3 b1 a3 b0 2 1 2 0 b 1 b 0 L10 - Multipliers 20
Moving down the cost curve... Suppose we have INFREQUENT multiplications... pipelining doesn t help us. Can we do better from a cost/performance standpoint? Hmmm, are all these extras really needed? a 3 a 0 b 2 a 1 a 0 b 3 a 0 b 2 a 2 a 1 b 3 a 0 b 1 a 1 b 2 a 0 b 0 a 2 b 3 a 1 b 1 a 2 b 2 a 1 b 0 a 3 b 3 a 2 b 1 a 3 b 2 a 2 b 0 a 3 b 1 3 0 b 1 b 0 L10 - Multipliers 21
Multiplier Cookbook: Chapter 4 a i b 3 b 2 b 1 ai b3 ai b2 ai b1 ai b0 b 0 Sequential Multiplier: Re-uses a single n-bit slice to emulate each pipeline stage a operand entered serially Lots of details to be filled in... Stages: Clock Period: Hardware cost for n by n bits: Latency: Throughput: 1 Θ (1) (constant!) Θ (n) Θ (n) Θ (1/n) L10 - Multipliers 22
(Ridiculous?) Extremes Dept... Cost minimization: how far can we go? a i b 3 b 2 b 1 i 3 i 2 i 1 i 0 b 0 Suppose we want to minimize hardware (at any cost) Consider bit-serial! Form and add 1-bit partial product per clock Reuse single brick for each bit b j of slice; Re-use slice for each bit of a operand L10 - Multipliers 23
Multiplier Cookbook: Chapter 5 Bit Serial multiplier: Re-uses a single brick to emulate an n-bit slice both operands entered serially O(n 2 ) clock cycles required Needs additional storage (typically from existing registers) a i b 3 b 2 b 1 ai b3 i 2 i 1 i 0 b 0 Stages: Clock Period: Hardware cost for n by n bits: Latency: Throughput: Θ (1 n ) Θ (1) (constant!) Θ (1) +? Θ (n 2 ) Θ (1/n 2 ) L10 - Multipliers 24
Summary: Scheme: $ Latency Thruput Combinational Θ(n 2 ) Θ(n) Θ(1/n) N-pipe Θ(n 2 ) Θ(n) Θ(1) Slice-serial Θ(n) Θ(n) Θ(1/n) Bit-serial Θ(1) * Θ(n 2 ) Θ(1/n 2 ) Lots more multiplier technology: fast adders, Booth Encoding, column compression,... L10 - Multipliers 25